I enjoyed reading this article by Mark van der Laan, which has a number of noteworthy aspects. I feel it’s an excellent just-in-time reminder, which rightly demands a change in perspective: “We have to start respecting, celebrating, and teaching important theoretical statistical contributions that precisely define the identity of our field.” The real question is which are those topics?
Answer: which statistical concepts and tools are routinely used by non-statistician data scientists for their data-driven discovery? How many of them were discovered in the last three decades (and compare with the number of so-called “top journal” papers that get published every month!)? Are we moving in the right direction? Isn’t it obvious why “our field has been nearly invisible in key arenas, especially in the ongoing discourse on Big Data and data science.” (Davidian 2013). Selling the same thing under a new name will not going to help (in either research or teaching) ; we need to invent and recognize new ideas, which are beautiful & useful.
I totally agree with what he said, “Historically, data analysis was the job of a statistician, but, due to the lack of rigor that has developed in our field, I fear our representation in data science is becoming marginalized.” I believe the first step is to go beyond the currently fashionable plug-and-play type model building attitude – let’s make it an Interactive and Iterative (thus more enjoyable) process based on few fundamental and unified rules. Another way of saying the same thing is, “the smartest thing on the planet is neither man nor machine – its the combination of the two” [George Lee].
He refers to the famous quote “All models are wrong, but some are useful.” He also expressed the concern that “Due to this, models that are so unrealistic that they are indexed by a finite dimensional parameter are still the status quo, even though everybody agrees they are known to be false.”
To me the important question is: Can we systematically discover the useful ones rather than starting with a guess solely based on convenience–typically two types: Theoretical and Computational. (Classical) Theoreticians like to stay in the perpetual fantasy world of “optimality,” whereas the (present-day) Computational goal is to make it “faster” by hook or crook.
It seems to me that the ultimate goal is to devise a “Nonparametric procedure to Discover Parametric models” (The Principle of NDP), which are simple and better than “models of convenience.” Do we have any systematic modeling strategy for that? [An example]
“Stop working on toy problems, stop talking down theory, stop being attached to outdated statistical methods, stop worrying about the politics of our journals and our field. Be a true and proud statistician who is making an impact on the real world of Big Data. The world of data science needs us—let’s rise to the challenge.”