Back

 Industry News Details

 
Integrating Science With Data For Reliable Machine Learning Models Posted on : Jul 13 - 2020

In my data analytics and machine learning (ML) consulting engagements, I often come across use cases aimed at solving scientific problems using data, such as predicting the failure of a turbine or forecasting the carbon footprint of our IT data center. But what exactly is a scientific problem, and how is it different from a data problem? Is it really necessary to validate a known scientific fact or model again with data? Before answering these questions, let's define some key terms and scientific laws needed to answer these questions.

First, data is a record of a phenomenon and is inherently historical. A set of data can be used to make a hypothesis, which is a possible explanation of the phenomena. However, we can also get new evidence or new data that can confirm or contradict the hypothesis from experiments. An experiment is a procedure carried out to validate the hypothesis. When the hypothesis repeatedly holds true, it is accepted as scientific law. In the end, a scientific law is a statement that summarizes the results from both hypotheses and experiments under some standard operating conditions.

In this backdrop, a scientific problem is dependent on standard operating conditions, and that is represented in the model above as explanatory variables. A scientific problem can be easily solved because the model is proven. On the other hand, a data problem is typically involved in determining the unknown using data, including hypotheses and experiments.

So, can we solve scientific problems using data? Let's say there is a hypothetical prediction (regression) model that predicts the failure of an electric generator based on three independent variables — load, vibration and coolant. These three independent variables are selected based on scientific laws or known insights under standard operating conditions. However, there could be a fourth independent variable that needs to be factored into the scientific model because the generator now is operating in a new environment and not in the standard operating conditions.

So, when regression analysis is performed using data only from the three known independent variables, the regression or prediction model might fail because a significant variable is not factored into the prediction. The converse is also true: If the data that is collected is not relevant in predicting the failure of the generator, it is better to stop collecting that data because data collection is usually a very expensive process in most organizations.

This problem is not just applicable to the engineering domain. Pharma businesses have been exploring data to define patterns within sets of health information, clinical trial results and research projects to augment their existing scientific knowledge in the life sciences domain. The key point here is that science alone is often not sufficient to develop reliable data analytics and ML models. Scientific models need to be augmented and continuously validated with data, especially when the conditions are not standard or when there is uncertainty. View More