Global Big Data Conference

Industry News Details

The importance of data audits when building AI Posted on : Apr 09 - 2022

Artificial intelligence can do a lot to improve business practices, but AI algorithms can also introduce new avenues of risk. For example, consider Zillow’s recent shutdown of Offers, the branch of the company dedicated to buying fixer uppers, after its prediction models significantly overshot house values. When housing price data changed unpredictably, the group’s machine-learning models didn’t adapt quickly enough to account for the volatility, resulting in significant losses. This type of data mismatch or “concept drift” happens if you don’t give proper care and respect to data audits.

Zillow’s failure to properly audit its data didn’t just hurt the company; it could have caused wider damage by scaring other businesses away from AI. Negative perceptions of a technology can halt its progress in the commercial world, especially for a category like AI that already went through several winters. Machine-learning pioneers like Andrew Ng recognize what hangs in the balance and have started campaigns to emphasize the importance of data audits by doing things like holding an annual competition for the best data quality assurance methods (instead of picking winners based just on model as it’s traditionally been done).

Beyond my own work to build AI, as host of The Robot Brains podcast, I’ve also interviewed dozens of AI practitioners and researchers about their approach to auditing and maintaining high-quality data. Here are some of best practices I’ve compiled from that work:

Beware of outsourcing your data curation and labeling. Data maintenance isn’t the sexiest task and it’s time intensive. When time is short, as it is for most entrepreneurs, it’s tempting to outsource the responsibility. But beware of the risks that come with it. A third-party vendor won’t be as intimately familiar with your product vision, know contextual nuances, or have the personal incentives to keep the close reins that are required. Andrej Karpathy, head of AI for Tesla, says that he uses 50% of his own time on maintaining the vehicles’ data playbooks because it’s that important.

If your data is incomplete, address the gaps. All is not lost if your data sources reveal gaps or potential areas for erroneous prediction. One source that’s often problematic is demographic data. As we know, historical demographic data sources tend to skew towards white males, and that can bias your entire model. Princeton professor and co-founder of AI4All, Olga Russakovsky, created the REVISE model, which brings to light patterns of correlations (possibly spurious) in visual data. You can use the model to request insensitivity to these patterns or decide to collect more data that doesn’t have the patterns. (Here is the code to run the model if you want to use it.) Demographic data is most often cited in this type of situation (i.e. medical history data has traditionally had a higher percentage of information about Caucasian males), but it can be applied in any scenario.

Understand the implications of sacrificing intelligence for speed. Your data audit may motivate you to plug in larger data sets with more complete coverage. In theory, that might seem like a great strategy, but it can actually be a mismatch for the business goal at hand. The larger the data set, the slower the analysis. Is that extra time justified by the value of the increased insight? View more

Get the