Data Lakes, Ponds, and Puddles


Using the cute little graphic above, Amazon web services define a data lake as: “… a centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data, and run different types of analytics—from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions.”

The idea has undeniable appeal.  Machine learning algorithms are revolutionizing our world, and the fuel for this engine is data. The promise is that with sufficient computing power and enough data, you can forego constructing statistical models. Throw all of the data into a pot (or a lake) and let the computer figure out what is meaningful.

In “The Unreasonable Effectiveness of Data,” the authors — top Google researchers — emphasize that with terabytes of data feeble algorithms suddenly become powerful.  “With a corpus of thousands of photos, the results were poor. But once they accumulated millions of photos, the same algorithm performed quite well.” Some have even argued that we are on the verge of a paradigm shift, with data-driven discovery replacing the hypothesis-driven scientific method. While I’m not yet ready to pronounce the scientific method dead, the power of data analytics is undeniable.

Higher education administration is late to the party. Quoting  Big Data on Campus, “Despite some newfound emphasis on data analytics, education officials are not yet adept at using analytics to support institutional decision making.” As competition intensifies around a declining number of college students, universities that fail to embrace data-informed decision making will struggle to survive.

I would argue that the principal obstacle to developing advanced analytics at Temple University, is our data structure. We do not have a data lake; we have a few ponds and a lot of puddles.  We are far from the utopia described in Big Data on Campus: “a mutually supportive environment where data users with analytic needs and appropriate security clearance can connect to all available data resources across different organization vectors to detect patterns or connections that a single data silo will not help.”

Recently, I have seen signs of real progress at Temple. Visualization tools are being talked about in meetings and even AI tools are making an appearance.  Different units across the university are talking about how to share their data. Data-informed decision making — a fusion of algorithms, data, and human insight — is coming to Temple. Slowly, but it IS coming.

This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply