Tag Archives: Data Mechanic

Data Scientist and Data Mechanic

Beware of “Kaggle Syndrome”. Refuse to jump on this bandwagon of playing brainless task of combining random algorithms to win a competition.  In any case, this will make NO impact (other than 15 minutes of fame) as happened with the Netflix competition.

As the datasets are getting BIG and COMPLEX, the most difficult challenge for Statistical Scientist is to figure out “Where is the information hidden.”  It’s an interactive process of investigation rather than a passive application of algorithms and calculating error rates. Two critical skills:  (1)  “look at the data”, which is missing in the mechanical push the button culture; and (2)  learn “how to question the data”, rather than only answering a specific question.  They allow data scientists to discover the unexpected in addition to the usual verification of the expected.

This begs the question whether

  • the Data Science training curriculum should look like a long manual of specialized methods and (series of cookbook) algorithms;
  • or, should train students (and industry professionals) in the Scientific Data Exploration (Sci-Dx) — A systematic and pragmatic approach to data modeling addressing the “Monkey and banana problem” [Pigeon’s approach] for practitioners. [I believe Wolfgang Kohler‘s “insight learning” idea can guide us to  develop such a curriculum.]

The first path will produce DataRobots, not Data Scientists. The later goal looks out of reach unless we figure out how to design the “LEGO Bricks” of Statistical Science (fundamental building blocks of Statistical learning), which help to understand disparate Statistical procedures from a common perspective (thus reduces the size of the manual) and can be appropriately combined to build versatile data products brick by brick.