Tag Archives: Science of Statistics

Two sides of Theoretical Data Science: Analysis and Synthesis

This discussion is motivated by a simple question (that was posed to me by a theoretical computer scientist): “What is the role of Statistics in the development of Theoretical Data Science?” The answer lies in understanding the big picture.

Theory of [Efficient] Computing: A branch of Theoretical Computer Science that deals with how quickly one can solve (compute) a given algorithm.  The critical task is to analyze algorithms carefully based on their performance characteristics to make it computationally efficient.

Theory of Unified Algorithms: An emerging branch of Theoretical Statistics that deals with how efficiently one can represent a large class of diverse algorithms using a single unified semantics. The critical task is to put together different “mini-algorithms” into a coherent master algorithm.

For overall development of Data Science, we need both ANALYSIS + SYNTHESIS. However, it is also important to bear in mind the distinction between the two.

 

 

 

The “Science” and “Management” of Data Analysis

Hierarchy and branches of Statistical Science

The phrases “Science” and “Management” of data analysis were introduced by Manny Parzen (2001) while discussing Leo Breiman’s Paper on “Statistical Modeling: The Two Cultures,” where he pointed out:

Management seeks profit, practical answers (predictions) useful for decision making in the short run. Science seeks truth, fundamental knowledge about nature which provides understanding and control in the long run.

Management = Algorithm, prediction and inference is undoubtedly the most useful and “sexy” part of Statistics. Over the past two decades, there have been tremendous advancements made in this front, leading to a growing number of literature and excellent textbooks like Hastie, Tibshirani, and Friedman (2009) and more recently Efron and Hastie (2016).

Nevertheless, we surely all agree that algorithms do not arise in a vacuum and our job as a Statistical scientist should be better than just finding another “gut” algorithm. It has long been observed that elegant statistical learning methods can be often derived from something more fundamental. This forces us to think about the guiding principles for designing (wholesale) algorithms. The “Science” of data analysis = Algorithm discovery engine (Algorithm of Algorithms). Finding such a consistent framework of Statistical Science (from which one might be able to systematically derive a wide range of working algorithms) promises to not be trivial.


Above all, I strongly believe the time has come to switch our focus from “management” to the heart of the matter: how can we create an inclusive and coherent framework of data analysis (to accelerate the innovation of new versatile algorithms)–“A place for everything, and everything in its place”– encoding the fundamental laws of numbers. In this (difficult yet rewarding) journey, we have to remind ourselves constantly the enlightening piece of advice from Murray Gell-Mann (2005):

We have to get rid of the idea that careful study of a problem in some NARROW range of issues is the only kind of work to be taken seriously, while INTEGRATIVE thinking is relegated to cocktail party conversation

 

Confirmatory Culture: Time To Reform or Conform?

Confirmatory culture is deep rooted within Statistics.

THEORY

Culture 1: Algorithm + Theory: the role of theory is to justify or confirm.

Culture 2: Theory + Algorithm: From confirmatory to constructive theory, explaining the statistical origin of the algorithm(s)–an explanation of where they came from. Culture 2 views “Algorithms” as the derived product, not the fundamental starting point [this point of view separates statistical science from machine learning].

PRACTICE 

Culture 1: Science + Data: Job of a Statistician is to confirm scientific guesses. Thus, happily play in everyone’s backyard as a confirmatist.

Culture 2: Data + Science: Exploratory nonparametric attitude. Plays in the front-yard as the key player in order to guide scientists to ask the “right question”.

TEACHING 

Culture 1: It proceeds in the following sequences:

for (i in 1:B) {
Teach Algorithm-i;
Teach Inference-i;
Teach Computation-i
}

By construction, it requires extensive bookkeeping and memorization of a long list of disconnected algorithms.

Culture 2: The pedagogical efforts emphasize the underlying fundamental principles and statistical logic whose consequences are algorithms. This “short-cut” approach substantially accelerates the learning by making it less mechanical and intimidating.

Should we continue to conform to the confirmatory culture or It’s time to reform? The choice is ours and the consequences are ours as well.

Models of Convenience to Useful Models

I enjoyed reading this article by Mark van der Laan, which has a number of noteworthy aspects. I feel it’s an excellent just-in-time reminder, which rightly demands a change in perspective: “We have to start respecting, celebrating, and teaching important theoretical statistical contributions that precisely define the identity of our field.” The real question is which are those topics?

Answer: which statistical concepts and tools are routinely used by non-statistician data scientists for their data-driven discoveryHow many of them were discovered in the last three decades (and compare with the number of so-called “top journal” papers that get published every month!)? Are we moving in the right direction? Isn’t it obvious why “our field has been nearly invisible in key arenas, especially in the ongoing discourse on Big Data and data science.” (Davidian 2013). Selling the same thing under a new name will not going to help (in either research or teaching) ; we need to invent and recognize new ideas, which are beautiful & useful.

I totally agree with what he said, “Historically, data analysis was the job of a statistician, but, due to the lack of rigor that has developed in our field, I fear our representation in data science is becoming marginalized.” I believe the first step is to go beyond the currently fashionable plug-and-play type model building attitude – let’s make it an Interactive and Iterative (thus more enjoyable) process based on few fundamental and unified rules. Another way of saying the same thing is, “the smartest thing on the planet is neither man nor machine – its the combination of the two” [George Lee].

He refers to the famous quote “All models are wrong, but some are useful.” He also expressed the concern that “Due to this, models that are so unrealistic that they are indexed by a finite dimensional parameter are still the status quo, even though everybody agrees they are known to be false.”

To me the important question is: Can we systematically discover the useful ones rather than starting with a  guess solely based on convenience–typically two types: Theoretical and Computational.  (Classical) Theoreticians like to stay in the perpetual fantasy world of “optimality,”  whereas the (present-day) Computational goal is to make it “faster” by hook or crook.

It seems to me that the ultimate goal is to devise a “Nonparametric procedure to Discover Parametric models” (The Principle of NDP), which are simple and better than “models of convenience.” Do we have any systematic modeling strategy for that? [An example]

 

Stop working on toy problems, stop talking down theory, stop being attached to outdated statistical methods, stop worrying about the politics of our journals and our field. Be a true and proud statistician who is making an impact on the real world of Big Data. The world of data science needs us—let’s rise to the challenge.”

Beauty and Truth

Murray Gell-Mann discussing why elegant equations more likely to be right than inelegant ones.

What is Beauty ? When in terms of some mathematical notation, you can write the theory in a very brief space, without a lot of complication that’s essentially what we mean by beauty or elegance.

What is the role of Unification ? We believe there is a unified theory underlying all the regularities. Steps toward unification exhibit the simplicity and self-similarity across the scales.  Therefore the math for one skin (of the onion) allows you to express beautifully and simply the phenomenon of the next skin.

“You don’t need something more to  explain something more.”

Answer Machine vs. Single Answer

Professor Manjul Bhargava in his recent interview mentioned “Mathematics is about coming up with your own creative ways to come to that one right answer. There’s not one path and everybody has their personal path that they can discover and that’s what makes it fun. That’s the adventurous part of mathematics, the creative part of mathematics”

From that perspective I feel Statistics is MORE fun and adventurous because there is  no “one answer’’ .  Develop your own way to arrive at your own solution that fits the data [Art of Statistics]. The real question to me is: Can we derive these different levels of answers in a systematic way ? [Science of Statistics].

We need to popularize the idea of  “Answer Machine” recommended by Manny Parzen (more than a decade back) where he mentioned “I believe that the concept of “answer machine” is required to explain what statisticians do. Mathematics finds a definite answer to a problem; statistics provide answer machines, which are formulas that can be adapted to compute and compare answers under varying assumptions”