Tag Archives: Next-Generation Statisticians

Confirmatory Culture: Time To Reform or Conform?

Confirmatory culture is deep rooted within Statistics.


Culture 1: Algorithm + Theory: the role of theory is to justify or confirm.

Culture 2: Theory + Algorithm: From confirmatory to constructive theory, explaining the statistical origin of the algorithm(s)–an explanation of where they came from. Culture 2 views “Algorithms” as the derived product, not the fundamental starting point [this point of view separates statistical science from machine learning].


Culture 1: Science + Data: Job of a Statistician is to confirm scientific guesses. Thus, happily play in everyone’s backyard as a confirmatist.

Culture 2: Data + Science: Exploratory nonparametric attitude. Plays in the front-yard as the key player in order to guide scientists to ask the “right question”.


Culture 1: It proceeds in the following sequences:

for (i in 1:B) {
Teach Algorithm-i;
Teach Inference-i;
Teach Computation-i

By construction, it requires extensive bookkeeping and memorization of a long list of disconnected algorithms.

Culture 2: The pedagogical efforts emphasize the underlying fundamental principles and statistical logic whose consequences are algorithms. This “short-cut” approach substantially accelerates the learning by making it less mechanical and intimidating.

Should we continue to conform to the confirmatory culture or It’s time to reform? The choice is ours and the consequences are ours as well.

The Scientific Core of Data Analysis

My observation is motivated by Richard Courant‘s view:

However, the difficulty that challenges the inventive skill of the applied mathematician is to find suitable coordinate functions.

He also noted that

If these functions are chosen without proper regard for the individuality of the problem the task of computation will become hopeless.

This leads me to the following conjecture: Efficient nonparametric data transformation or representation scheme is the basis for almost all successful learning algorithms–the Scientific Core of Data Analysis–that should be emphasized in research, teaching, and practice of 21st century Statistical Science to develop a systematic and unified theory of data analysis (Foundation of data science).

Two Kinds of Mathematical Statisticians: Connectionist and Confirmatist

In the field of Statistical research, Mathematicians or Theoreticians come in two very distinct flavors:

Connectionist: Mathematicians who invent and connect novel algorithms based on new fundamental ideas that address real data modeling problems.

Confirmatist: Mathematicians who prove why an existing algorithm works under certain sets of assumptions/conditions (post-mortem report).

Albeit, the theoreticians of the first kind (few examples: Karl Pearson, Jerzy Neyman, Harold Hotelling, Charles Stein, Emanuel Parzen, Clive Granger)  are much more rare than the second one. The current culture has failed to distinguish between these two types (which are very different in their style and motivation) and has put excessive importance on the second culture – this has created  an imbalance and often gives a wrong impression of what “Theory” means. We need to discover new theoretical tools that not only prove why the already invented algorithms work (confirmatory check) but also provide the insights into how to invent and connect novel algorithms for effective data analysis – 21st-century statistics.

Impact: The way I see it

Quantifying impact is a difficult task. However, to me, it is governed by a simple equation:

Theoretical beauty  x  Practical utility  =  Impact of your work.

  • By Theoretical Beauty, I mean the ability/capacity of “Unification” of any concept/idea. (not proving consistency or rate of convergence).
  • Practical utility denotes the generic usefulness of the algorithm (simultaneously applicable for many problems) – Wholesale algorithms. (not just writing R-packages and coding).
  • The goal is to ensure that none of the quantities in the LHS of the equation are close to ZERO. Perfect balance is required to maximize the impact (which is an art).

Models of Convenience to Useful Models

I enjoyed reading this article by Mark van der Laan, which has a number of noteworthy aspects. I feel it’s an excellent just-in-time reminder, which rightly demands a change in perspective: “We have to start respecting, celebrating, and teaching important theoretical statistical contributions that precisely define the identity of our field.” The real question is which are those topics?

Answer: which statistical concepts and tools are routinely used by non-statistician data scientists for their data-driven discoveryHow many of them were discovered in the last three decades (and compare with the number of so-called “top journal” papers that get published every month!)? Are we moving in the right direction? Isn’t it obvious why “our field has been nearly invisible in key arenas, especially in the ongoing discourse on Big Data and data science.” (Davidian 2013). Selling the same thing under a new name will not going to help (in either research or teaching) ; we need to invent and recognize new ideas, which are beautiful & useful.

I totally agree with what he said, “Historically, data analysis was the job of a statistician, but, due to the lack of rigor that has developed in our field, I fear our representation in data science is becoming marginalized.” I believe the first step is to go beyond the currently fashionable plug-and-play type model building attitude – let’s make it an Interactive and Iterative (thus more enjoyable) process based on few fundamental and unified rules. Another way of saying the same thing is, “the smartest thing on the planet is neither man nor machine – its the combination of the two” [George Lee].

He refers to the famous quote “All models are wrong, but some are useful.” He also expressed the concern that “Due to this, models that are so unrealistic that they are indexed by a finite dimensional parameter are still the status quo, even though everybody agrees they are known to be false.”

To me the important question is: Can we systematically discover the useful ones rather than starting with a  guess solely based on convenience–typically two types: Theoretical and Computational.  (Classical) Theoreticians like to stay in the perpetual fantasy world of “optimality,”  whereas the (present-day) Computational goal is to make it “faster” by hook or crook.

It seems to me that the ultimate goal is to devise a “Nonparametric procedure to Discover Parametric models” (The Principle of NDP), which are simple and better than “models of convenience.” Do we have any systematic modeling strategy for that? [An example]


Stop working on toy problems, stop talking down theory, stop being attached to outdated statistical methods, stop worrying about the politics of our journals and our field. Be a true and proud statistician who is making an impact on the real world of Big Data. The world of data science needs us—let’s rise to the challenge.”

Time to change the rules of the game

This recent Nature article reminds us the importance of Nonparametric Exploratory Modeling (Data + Science NOT Science + Data) attitude where Scientists ask questions about the validity of the Statistical findings or in other words as the article suggests “The numbers are where the scientific discussion should start, not end.

But unfortunately “The basic framework of statistics has been virtually unchanged since Fisher, Neyman and Pearson introduced it.” Isn’t it high time “to change how statistics is taught, how data analysis is done and how results are reported and interpreted.

Data Then Science, Next-Generation Statisticians

Data Science: The word “Data” precedes the word “Science”, which signifies a dramatic change in modeling attitude in 21st century.

  • Statistical validation of scientific guesses   —->   scientific validation of Statistical findings.
  • Parametric Confirmatory (Science + Data)     ——–>    Nonparametric exploratory modeling (Data + Science).

Question: How many of such nonparametric exploratory modeling tools (not inferential tool !) we have developed in last three decades?