Back

Speaker "Jonas Mueller" Details Back

 

Topic

Operationalizing data-centric AI: Practical methods to quickly improve ML datasets

Abstract

In applied ML projects, experienced data scientists know that improving data brings higher ROI than tinkering with models. However the process of finding and fixing problems in a dataset is highly manual (ad hoc ideas explored in Jupyter notebooks). Cleanlab develops open-source software to help make this process more: efficient (via novel algorithms that automatically detect certain issues in data) and systematic (with better coverage to detect different types of issues). This talk will describe how high-level ideas from data-centric AI can be operationalized across a wide variety of datasets (image, text, tabular, etc). I will introduce novel algorithmic strategies to automatically identify various issues in data that we have researched and published papers on with extensive benchmarks. These include detection of label errors, bad data annotators, out-of-distribution examples, and other dataset problems that once identified can be easily addressed to significantly improve trained models. Thousands of data scientists have started using this sort of data-centric AI software, and results from a few case studies will be presented.

Who is this presentation for?
Data Scientists and ML Engineers working on real-world problems

Prerequisite knowledge:
The intended audience is folks with experience in supervised learning who want to develop the most effective ML for messy, real-world applications. Some of the content will be technical, but not require a deep understanding of how particular ML algorithms/model work (having completed one previous ML course/project should suffice). The topics should be of interest to anybody working in: computer vision, natural language processing, audio/speech or tabular data, and other standard supervised learning applications, as well as DataOps folks.

What you'll learn?
How to best practice data-centric AI in real-world ML projects. This covers automated methods to check the dataset for various issues common in ML data as well as how to efficiently address the issues to improve the dataset and subsequent ML model. I will cover novel algorithms invented by our research team and case studies which showcase the benefits of data-centric AI in real-world ML applications.

Profile

Jonas Mueller is Chief Scientist and Co-Founder at Cleanlab, a software company providing data-centric AI tools to efficiently improve ML datasets. Previously, he was a senior scientist at Amazon Web Services developing AutoML and Deep Learning algorithms which now power ML applications at hundreds of the world's largest companies. In 2018, he completed his PhD in Machine Learning at MIT, also doing research in NLP, Statistics, and Computational Biology. Jonas has published over 30 papers in top ML and Data Science venues (NeurIPS, ICML, ICLR, AAAI, JASA, Annals of Statistics, etc). This research has been featured in Wired, VentureBeat, Technology Review, World Economic Forum, and other media. He loves contributing to open-source, and helped create the fastest-growing open-source software for AutoML (https://github.com/awslabs/autogluon) and Data-Centric AI (https://github.com/cleanlab/cleanlab). An avid educator, he also taught the first course on data-centric AI: https://dcai.csail.mit.edu/