From a statistical analysis point of view, data can be classified into the following classes:
- Data Type: discrete, continuous and mixed (combination of discrete and continuous data).
- Data Structure: univariate, multivariate, time series, spatial, image, graph/network etc….(roughly in the order of increasing complexity)
- Data Pattern: linear/non-linear, stationary/non-stationary, etc…
- Data Size: small, medium and big (though this may be vague as today’s big data are tomorrow’s small data)
There is a long history of developing core statistical modeling principles that are valid for any combinations of the above list. A few examples include bivariate continuous regression (Francis Galton ), multivariate discrete data (Karl Pearson, Udny Yule, Leo Goodman), mixed data (Thomas Bayes, Student, R.A. Fisher,Fix and Hodges), time series (Norbert Wiener, Box & Jenkins, Emanuel Parzen, John Tukey,David Brillinger), non-stationary (Clive Granger, Robert Engle) and non-linearity (Grace Wahba, Cleveland ) .
To tackle these rich varieties of data, many cultures of statistical science have been developed over the last century, which can be broadly classified as (1) Parametric confirmatory; (2) Nonparametric exploratory and (3) Optimization-driven Algorithmic approaches.
United Statistical Algorithm. I claim what we need is a breakthrough—“Periodic Table of Data Science.” Developing new algorithms in an isolated manner will not be enough to justify “learning from data” as a proper scientific endeavor. We have to put some order (by understanding their internal statistical structure) into the current inventory of algorithms that are mushrooming at a staggering rate these days. The underlying unity on “how they relate to each other” will dictate what the Fundamental Principles of Data Science are. At a more practical level, this will enable data scientists to predict new algorithms in a systematic way rather than trial & error.
Theory of Data Analysis: How can we develop such a consistent and unified framework of data analysis (the foundation of data science) that would reveal the interconnectedness among different branches of statistics? This remains one of the most vexing mysteries of modern Statistics. However, developing such a theory (leading to progressive unification of fundamental statistical learning tools) can have enormous implications for theory and practice of data analysis.
An example of modern data challenge: Is there any fundamental universal modeling principle to tackle these large varieties of data ? This is still an Unsolved Problem, especially difficult to solve Programmatically. Big Data experts have started to recognize that this is a real challenging “Research problem! Killing most CIO’s” and “if there is any achilles heel it’s going to be this.” Check out this recent Talk [PDF, Video (from 1:32:00 – 1:38:00)] by Turing laureate Michael Stonebraker @ White House Office of Science & Technology Policy and MIT, March 3, 2014, and also this, this one. I conjecture Statistics can play the crucial role in solving this BIG-Variety problem.