# Two sides of Theoretical Data Science: Analysis and Synthesis

This discussion is motivated by a simple question (that was posed to me by a theoretical computer scientist): “What is the role of Statistics in the development of Theoretical Data Science?” The answer lies in understanding the big picture.

Theory of [Efficient] Computing: A branch of Theoretical Computer Science that deals with how quickly one can solve (compute) a given algorithm.  The critical task is to analyze algorithms carefully based on their performance characteristics to make it computationally efficient.

Theory of Unified Algorithms: An emerging branch of Theoretical Statistics that deals with how efficiently one can represent a large class of diverse algorithms using a single unified semantics. The critical task is to put together different “mini-algorithms” into a coherent master algorithm.

For overall development of Data Science, we need both ANALYSIS + SYNTHESIS. However, it is also important to bear in mind the distinction between the two.

# Confirmatory Culture: Time To Reform or Conform?

Confirmatory culture is deep rooted within Statistics.

THEORY

Culture 1: Algorithm + Theory: the role of theory is to justify or confirm.

Culture 2: Theory + Algorithm: From confirmatory to constructive theory, explaining the statistical origin of the algorithm(s)–an explanation of where they came from. Culture 2 views “Algorithms” as the derived product, not the fundamental starting point [this point of view separates statistical science from machine learning].

PRACTICE

Culture 1: Science + Data: Job of a Statistician is to confirm scientific guesses. Thus, happily play in everyone’s backyard as a confirmatist.

Culture 2: Data + Science: Exploratory nonparametric attitude. Plays in the front-yard as the key player in order to guide scientists to ask the “right question”.

TEACHING

Culture 1: It proceeds in the following sequences:

for (i in 1:B) {
Teach Algorithm-i;
Teach Inference-i;
Teach Computation-i
}

By construction, it requires extensive bookkeeping and memorization of a long list of disconnected algorithms.

Culture 2: The pedagogical efforts emphasize the underlying fundamental principles and statistical logic whose consequences are algorithms. This “short-cut” approach substantially accelerates the learning by making it less mechanical and intimidating.

Should we continue to conform to the confirmatory culture or It’s time to reform? The choice is ours and the consequences are ours as well.

# Data Scientist and Data Mechanic

Beware of “Kaggle Syndrome”. Refuse to jump on this bandwagon of playing brainless task of combining random algorithms to win a competition.  In any case, this will make NO impact (other than 15 minutes of fame) as happened with the Netflix competition.

As the datasets are getting BIG and COMPLEX, the most difficult challenge for Statistical Scientist is to figure out “Where is the information hidden.”  It’s an interactive process of investigation rather than a passive application of algorithms and calculating error rates. Two critical skills:  (1)  “look at the data”, which is missing in the mechanical push the button culture; and (2)  learn “how to question the data”, rather than only answering a specific question.  They allow data scientists to discover the unexpected in addition to the usual verification of the expected.

This begs the question whether

• the Data Science training curriculum should look like a long manual of specialized methods and (series of cookbook) algorithms;
• or, should train students (and industry professionals) in the Scientific Data Exploration (Sci-Dx) — A systematic and pragmatic approach to data modeling addressing the “Monkey and banana problem” [Pigeon’s approach] for practitioners. [I believe Wolfgang Kohler‘s “insight learning” idea can guide us to  develop such a curriculum.]

The first path will produce DataRobots, not Data Scientists. The later goal looks out of reach unless we figure out how to design the “LEGO Bricks” of Statistical Science (fundamental building blocks of Statistical learning), which help to understand disparate Statistical procedures from a common perspective (thus reduces the size of the manual) and can be appropriately combined to build versatile data products brick by brick.

# The Scientific Core of Data Analysis

My observation is motivated by Richard Courant‘s view:

However, the difficulty that challenges the inventive skill of the applied mathematician is to find suitable coordinate functions.

He also noted that

If these functions are chosen without proper regard for the individuality of the problem the task of computation will become hopeless.

This leads me to the following conjecture: Efficient nonparametric data transformation or representation scheme is the basis for almost all successful learning algorithms–the Scientific Core of Data Analysis–that should be emphasized in research, teaching, and practice of 21st century Statistical Science to develop a systematic and unified theory of data analysis (Foundation of data science).

# Data Then Science, Next-Generation Statisticians

Data Science: The word “Data” precedes the word “Science”, which signifies a dramatic change in modeling attitude in 21st century.

• Statistical validation of scientific guesses   —->   scientific validation of Statistical findings.
• Parametric Confirmatory (Science + Data)     ——–>    Nonparametric exploratory modeling (Data + Science).

Question: How many of such nonparametric exploratory modeling tools (not inferential tool !) we have developed in last three decades?

# The Unsolved Problem of Statistics: The BIG Question

From a statistical analysis point of view,  data can be classified into the following classes:

• Data Type:  discrete, continuous and mixed (combination of discrete and continuous data).
• Data Structure: univariate, multivariate, time series, spatial, image, graph/network etc….(roughly in the order of increasing complexity)
• Data Pattern:  linear/non-linear, stationary/non-stationary, etc…
• Data Size: small, medium and big (though this may be vague as today’s big data are tomorrow’s small data)

There is a long history of developing core statistical modeling principles that are valid for any combinations of the above list. A few examples include bivariate continuous regression (Francis Galton ), multivariate discrete data (Karl Pearson, Udny Yule, Leo Goodman), mixed data (Thomas Bayes, Student, R.A. Fisher,Fix and Hodges), time series (Norbert Wiener, Box & Jenkins, Emanuel Parzen, John Tukey,David Brillinger), non-stationary (Clive Granger, Robert Engle) and non-linearity (Grace Wahba, Cleveland ) .

To tackle these rich varieties of data, many cultures of statistical science have been developed over the last century, which can be broadly classified as (1) Parametric confirmatory; (2) Nonparametric exploratory and (3) Optimization-driven Algorithmic approaches.

United Statistical Algorithm. I claim what we need is a breakthrough—“Periodic Table of Data Science.” Developing new algorithms in an isolated manner will not be enough to justify “learning from data” as a proper scientific endeavor. We have to put some order (by understanding their internal statistical structure) into the current inventory of algorithms that are mushrooming at a staggering rate these days. The underlying unity on “how they relate to each other” will dictate what the Fundamental Principles of Data Science are. At a more practical level, this will enable data scientists to predict new algorithms in a systematic way rather than trial & error.

Theory of Data Analysis: How can we develop such a consistent and unified framework of data analysis (the foundation of data science) that would reveal the interconnectedness among different branches of statistics? This remains one of the most vexing mysteries of modern Statistics. However, developing such a theory (leading to progressive unification of fundamental statistical learning tools) can have enormous implications for theory and practice of data analysis.

An example of modern data challenge: Is there any fundamental universal modeling principle to tackle these large varieties of data ?  This is still an  Unsolved Problem, especially difficult to solve ProgrammaticallyBig Data experts have started to recognize that this is a real challenging “Research problem! Killing most CIO’s” and “if there is any achilles heel it’s going to be this.” Check out this recent Talk [PDF, Video (from 1:32:00 – 1:38:00)] by Turing laureate  Michael Stonebraker @ White House Office of Science & Technology Policy and MIT, March 3, 2014, and also this, this one.  I conjecture Statistics can play the crucial role in solving this BIG-Variety problem.