Back

 Industry News Details

 
Five Reasons Why Organic Data Is Healthy For A Data Science Model Posted on : Sep 15 - 2021

Text data is one of the largest forms of unstructured data and is ever-growing. At Reorg, I work with large amounts of financial text data every day. One challenge of working with text data is that you need a large training data set to build robust models. You also need good, organic training data, which will be described in further detail in this article.

Machine learning (ML) models are only as good as the data used to train them. Over the years, I have collected training data to train several supervised ML models from databases where the data was labeled as part of some business process — or new training data from subject matter experts (SMEs), project managers and product managers. It is important to put in effort and time to ensure your training data is organic, meaning it is rich, robust and reliable.

In this short article, I will share five things (that fall under the acronym "CARES") to care about when collecting training data for training supervised ML models, particularly when using text-based data.

1. Consistency In Subjectivity

In text-classification problems, it is common to encounter a subjectivity conflict regarding the meaning of certain text to different users. A common example of this, in credit-related sentiment analysis, is the problem of defining a negative versus positive sentiment in an earnings call transcript.

When dealing with subjective data, it is important to have an overlap of training data examples that can help check reliability, and it is equally important to ensure consistency in the labeling of subjective language. Maintaining consistency in training data prevents the coexistence of multiple conflicting ground truth values for similar texts, which can introduce noise in an ML model, leading to underfitting of the model.

2. Avoid Bias

When starting to build a new supervised machine learning model that involves identifying and then classifying a novel text dataset, my team and I need to ask the SMEs for training examples to train the model. The SMEs tend to find training data examples using existing search bars or data queries and using keywords that they are familiar with and that, to their knowledge, best define the documents that are of interest. This training data then inherits the pattern that has been used to search or query for this data. This pattern, in turn, introduces bias to the supervised model training.

For example, when trying to identify merger and acquisition (M&A) related articles, one can provide training data by simply searching for keywords such as “merge” and “acquire” and providing us the results to consider as training data. However, this is not the exhaustive list of all language used in pertinent M&A headlines. The resulting model will underfit as it will rely on the used keywords as well as other strong co-occurring words but will not be as robust as if it were trained on thoroughly randomized data. View More