Tag Archives: 21st-century statistics

The “Science” and “Management” of Data Analysis

Hierarchy and branches of Statistical Science

The phrases “Science” and “Management” of data analysis were introduced by Manny Parzen (2001) while discussing Leo Breiman’s Paper on “Statistical Modeling: The Two Cultures,” where he pointed out:

Management seeks profit, practical answers (predictions) useful for decision making in the short run. Science seeks truth, fundamental knowledge about nature which provides understanding and control in the long run.

Management = Algorithm, prediction and inference is undoubtedly the most useful and “sexy” part of Statistics. Over the past two decades, there have been tremendous advancements made in this front, leading to a growing number of literature and excellent textbooks like Hastie, Tibshirani, and Friedman (2009) and more recently Efron and Hastie (2016).

Nevertheless, we surely all agree that algorithms do not arise in a vacuum and our job as a Statistical scientist should be better than just finding another “gut” algorithm. It has long been observed that elegant statistical learning methods can be often derived from something more fundamental. This forces us to think about the guiding principles for designing (wholesale) algorithms. The “Science” of data analysis = Algorithm discovery engine (Algorithm of Algorithms). Finding such a consistent framework of Statistical Science (from which one might be able to systematically derive a wide range of working algorithms) promises to not be trivial.


Above all, I strongly believe the time has come to switch our focus from “management” to the heart of the matter: how can we create an inclusive and coherent framework of data analysis (to accelerate the innovation of new versatile algorithms)–“A place for everything, and everything in its place”– encoding the fundamental laws of numbers. In this (difficult yet rewarding) journey, we have to remind ourselves constantly the enlightening piece of advice from Murray Gell-Mann (2005):

We have to get rid of the idea that careful study of a problem in some NARROW range of issues is the only kind of work to be taken seriously, while INTEGRATIVE thinking is relegated to cocktail party conversation

 

Confirmatory Culture: Time To Reform or Conform?

Confirmatory culture is deep rooted within Statistics.

THEORY

Culture 1: Algorithm + Theory: the role of theory is to justify or confirm.

Culture 2: Theory + Algorithm: From confirmatory to constructive theory, explaining the statistical origin of the algorithm(s)–an explanation of where they came from. Culture 2 views “Algorithms” as the derived product, not the fundamental starting point [this point of view separates statistical science from machine learning].

PRACTICE 

Culture 1: Science + Data: Job of a Statistician is to confirm scientific guesses. Thus, happily play in everyone’s backyard as a confirmatist.

Culture 2: Data + Science: Exploratory nonparametric attitude. Plays in the front-yard as the key player in order to guide scientists to ask the “right question”.

TEACHING 

Culture 1: It proceeds in the following sequences:

for (i in 1:B) {
Teach Algorithm-i;
Teach Inference-i;
Teach Computation-i
}

By construction, it requires extensive bookkeeping and memorization of a long list of disconnected algorithms.

Culture 2: The pedagogical efforts emphasize the underlying fundamental principles and statistical logic whose consequences are algorithms. This “short-cut” approach substantially accelerates the learning by making it less mechanical and intimidating.

Should we continue to conform to the confirmatory culture or It’s time to reform? The choice is ours and the consequences are ours as well.

The Scientific Core of Data Analysis

My observation is motivated by Richard Courant‘s view:

However, the difficulty that challenges the inventive skill of the applied mathematician is to find suitable coordinate functions.

He also noted that

If these functions are chosen without proper regard for the individuality of the problem the task of computation will become hopeless.

This leads me to the following conjecture: Efficient nonparametric data transformation or representation scheme is the basis for almost all successful learning algorithms–the Scientific Core of Data Analysis–that should be emphasized in research, teaching, and practice of 21st century Statistical Science to develop a systematic and unified theory of data analysis (Foundation of data science).

Two Kinds of Mathematical Statisticians: Connectionist and Confirmatist

In the field of Statistical research, Mathematicians or Theoreticians come in two very distinct flavors:

Connectionist: Mathematicians who invent and connect novel algorithms based on new fundamental ideas that address real data modeling problems.

Confirmatist: Mathematicians who prove why an existing algorithm works under certain sets of assumptions/conditions (post-mortem report).

Albeit, the theoreticians of the first kind (few examples: Karl Pearson, Jerzy Neyman, Harold Hotelling, Charles Stein, Emanuel Parzen, Clive Granger)  are much more rare than the second one. The current culture has failed to distinguish between these two types (which are very different in their style and motivation) and has put excessive importance on the second culture – this has created  an imbalance and often gives a wrong impression of what “Theory” means. We need to discover new theoretical tools that not only prove why the already invented algorithms work (confirmatory check) but also provide the insights into how to invent and connect novel algorithms for effective data analysis – 21st-century statistics.

Impact: The way I see it

Quantifying impact is a difficult task. However, to me, it is governed by a simple equation:

Theoretical beauty  x  Practical utility  =  Impact of your work.

  • By Theoretical Beauty, I mean the ability/capacity of “Unification” of any concept/idea. (not proving consistency or rate of convergence).
  • Practical utility denotes the generic usefulness of the algorithm (simultaneously applicable for many problems) – Wholesale algorithms. (not just writing R-packages and coding).
  • The goal is to ensure that none of the quantities in the LHS of the equation are close to ZERO. Perfect balance is required to maximize the impact (which is an art).

Models of Convenience to Useful Models

I enjoyed reading this article by Mark van der Laan, which has a number of noteworthy aspects. I feel it’s an excellent just-in-time reminder, which rightly demands a change in perspective: “We have to start respecting, celebrating, and teaching important theoretical statistical contributions that precisely define the identity of our field.” The real question is which are those topics?

Answer: which statistical concepts and tools are routinely used by non-statistician data scientists for their data-driven discoveryHow many of them were discovered in the last three decades (and compare with the number of so-called “top journal” papers that get published every month!)? Are we moving in the right direction? Isn’t it obvious why “our field has been nearly invisible in key arenas, especially in the ongoing discourse on Big Data and data science.” (Davidian 2013). Selling the same thing under a new name will not going to help (in either research or teaching) ; we need to invent and recognize new ideas, which are beautiful & useful.

I totally agree with what he said, “Historically, data analysis was the job of a statistician, but, due to the lack of rigor that has developed in our field, I fear our representation in data science is becoming marginalized.” I believe the first step is to go beyond the currently fashionable plug-and-play type model building attitude – let’s make it an Interactive and Iterative (thus more enjoyable) process based on few fundamental and unified rules. Another way of saying the same thing is, “the smartest thing on the planet is neither man nor machine – its the combination of the two” [George Lee].

He refers to the famous quote “All models are wrong, but some are useful.” He also expressed the concern that “Due to this, models that are so unrealistic that they are indexed by a finite dimensional parameter are still the status quo, even though everybody agrees they are known to be false.”

To me the important question is: Can we systematically discover the useful ones rather than starting with a  guess solely based on convenience–typically two types: Theoretical and Computational.  (Classical) Theoreticians like to stay in the perpetual fantasy world of “optimality,”  whereas the (present-day) Computational goal is to make it “faster” by hook or crook.

It seems to me that the ultimate goal is to devise a “Nonparametric procedure to Discover Parametric models” (The Principle of NDP), which are simple and better than “models of convenience.” Do we have any systematic modeling strategy for that? [An example]

 

Stop working on toy problems, stop talking down theory, stop being attached to outdated statistical methods, stop worrying about the politics of our journals and our field. Be a true and proud statistician who is making an impact on the real world of Big Data. The world of data science needs us—let’s rise to the challenge.”

The Unsolved Problem of Statistics: The BIG Question

From a statistical analysis point of view,  data can be classified into the following classes:

  • Data Type:  discrete, continuous and mixed (combination of discrete and continuous data).
  • Data Structure: univariate, multivariate, time series, spatial, image, graph/network etc….(roughly in the order of increasing complexity)
  • Data Pattern:  linear/non-linear, stationary/non-stationary, etc…
  • Data Size: small, medium and big (though this may be vague as today’s big data are tomorrow’s small data)

There is a long history of developing core statistical modeling principles that are valid for any combinations of the above list. A few examples include bivariate continuous regression (Francis Galton ), multivariate discrete data (Karl Pearson, Udny Yule, Leo Goodman), mixed data (Thomas Bayes, Student, R.A. Fisher,Fix and Hodges), time series (Norbert Wiener, Box & Jenkins, Emanuel Parzen, John Tukey,David Brillinger), non-stationary (Clive Granger, Robert Engle) and non-linearity (Grace Wahba, Cleveland ) .

To tackle these rich varieties of data, many cultures of statistical science have been developed over the last century, which can be broadly classified as (1) Parametric confirmatory; (2) Nonparametric exploratory and (3) Optimization-driven Algorithmic approaches.

United Statistical Algorithm. I claim what we need is a breakthrough—“Periodic Table of Data Science.” Developing new algorithms in an isolated manner will not be enough to justify “learning from data” as a proper scientific endeavor. We have to put some order (by understanding their internal statistical structure) into the current inventory of algorithms that are mushrooming at a staggering rate these days. The underlying unity on “how they relate to each other” will dictate what the Fundamental Principles of Data Science are. At a more practical level, this will enable data scientists to predict new algorithms in a systematic way rather than trial & error.

Theory of Data Analysis: How can we develop such a consistent and unified framework of data analysis (the foundation of data science) that would reveal the interconnectedness among different branches of statistics? This remains one of the most vexing mysteries of modern Statistics. However, developing such a theory (leading to progressive unification of fundamental statistical learning tools) can have enormous implications for theory and practice of data analysis.

 

Newton

An example of modern data challenge: Is there any fundamental universal modeling principle to tackle these large varieties of data ?  This is still an  Unsolved Problem, especially difficult to solve ProgrammaticallyBig Data experts have started to recognize that this is a real challenging “Research problem! Killing most CIO’s” and “if there is any achilles heel it’s going to be this.” Check out this recent Talk [PDF, Video (from 1:32:00 – 1:38:00)] by Turing laureate  Michael Stonebraker @ White House Office of Science & Technology Policy and MIT, March 3, 2014, and also this, this one.  I conjecture Statistics can play the crucial role in solving this BIG-Variety problem.