Back

 Industry News Details

 
Do We Need More Data Or More Science In Data Science? Posted on : Feb 20 - 2020

Is the success of Google that of the algorithms or that of data?

Today’s fascination with artificial intelligence (AI) reflects both our appetite for data and our excitement about the new opportunities in machine learning. Here, I argue that newcomers to the field of data science are blinded by the shiny object of magical algorithms -- and that they forget the critical infrastructures that are needed to create and to manage data in the first place. There are now many companies that provide AI services. To evaluate these commercial offers, it is useful to go through the following exercise:

• Do they offer expertise in AI (e.g., deep learning)?

• Do they offer the generation of data (e.g., high throughput images for drug action)?

• Do they offer access to data (e.g., privileged access to medical data)?

• Can they build the infrastructure to manage data (e.g., cloud and computing services)?

An attractive offer should affirm all of the above -- the sole expertise in analyses and algorithms is generally insufficient, as it does not necessarily address the data part of the equation.

Build With Purpose

Data management and infrastructures are the little ugly duckling of data science. Alas, it is the condition for a successful program and therefore needs to be built with purpose. This requires the careful consideration of strategies for data capture, storage of raw and processed data and instruments for retrieval. Data can be structured (e.g., names, dates, addresses) or unstructured (e.g., text, video, audio, imagery) but should always be collected under the principle that data is an asset. Why an asset? Because it can have intrinsic value beyond the original purpose why and when it was collected.

There are risks lurking in building infrastructure: underengineering and overengineering, underautomating and overautomating. People would spend two months of engineering time to save maybe two days of user time in a year. Being sensible about automation versus no-automation decisions can save hundreds of thousands of dollars. There is also a long-standing concern that data will accumulate -- say, from manufacturing processes or from drug clinical trials -- but that only a fraction will be eventually accessed. This may be more a historical reflection than the contemporary understanding of the intrinsic value of data. View More