Back

 Industry News Details

 
Training Data: Why Scale Is Critical for Your AI Future Posted on : Feb 22 - 2019

Data is the fuel that drives AI. But there’s a big difference in the quality of fuel you can put into your AI engine. If your enterprise can create the biggest stockpile of the highest quality training data, it will likely win the AI race, but getting there is no easy task.

For all the advanced skills that data scientists possess, there’s no escaping the fact they often spend up to 80% of their time cleaning and prepping data. Without good, clean data to feed into machine learning algorithms, the data scientist can’t be sure that the model will predict anything worthwhile.

And so data scientists spend much of their time doing what amounts to data janitorial work. While data scientists will always want to personally inspect some of the data that’s being used to train machine learning models, there clearly are better uses of data scientists’ time.

This situation has spawned a cottage industry of data labeling outfits. You provide the raw data and a description of what you’re after, and the data labeling company will distribute the work to its human workforce, who will apply the labels that will tell algorithms what to look for, leaving data scientists to concentrate on improving the model.

One of the data labeling outfits that’s helping companies to stockpile high quality AI fuel for American enterprises is Alegion. Nathaniel Gates founded the Austin, Texas company seven years ago as a crowdsourcing company for general purpose business automation tasks. But about three years ago, Gates noticed a change in the types of tasks customers were requesting.

“We didn’t really know quite what we were doing at that point,” Gates tells Datanami. “They needed 97% to 99% accurate data, and they needed it very large scale. You had to knock us on the head a few times for us to realize, oh holy cow, this is not going away. And we ended up rebuilding the platform, our whole technology stack, around very specifically the construction and development of very high quality training data.”

Garbage In, Garbage Out

The potential for bad data to negatively impact AI projects is not a theoretical threat. In fact, bad data threatens the very existence of AI, at least as we have come to define it. That’s why data scientists spend so much of their time making absolutely certain the data they’re feeding into their machine learning algorithms is as good as it can be.

Alegion finds itself working closely with data science teams to obtain training data. “We’re very much on the nose of the pain for these data science teams, because it’s still a garbage in, garbage out world,” Gates says. View More