Back

Speaker "Tarunam Mahajan" Details Back

 

Topic

Beyond Real Data: Scaling Machine Learning with NVIDIA-Powered Synthetic Data Generation

Abstract

The demand for high-quality, diverse, and perfectly-labeled data is the single greatest bottleneck in modern AI development. Real world data collection is costly, time consuming, privacy constrained, and often fails to capture the rare but critical "long-tail" edge cases. This session tackles this challenge head-on. We will explore the paradigm shift to synthetic data generation - the creation of photorealistic, physically - accurate data from simulations and generative AI. As a TPM at NVIDIA, I will provide a practical overview of how we leverage our full-stack platform, including NVIDIA Omniverse, Isaac Sim, and Omniverse Replicator, to create massive, annotated datasets for training robust perception models. We'll also explore the new frontier of using foundation models like NVIDIA Cosmos to create and augment virtual worlds, effectively bridging the "sim-to-real" gap. Attendees will leave with a clear, actionable framework for building a scalable data generation pipeline, moving their organizations from a state of data scarcity to one of data abundance.
Who is this presentation for?
This presentation is designed for AI/ML practitioners and leaders who are seeking to overcome data bottlenecks and accelerate their model development lifecycles. Primary Audience: Machine Learning Engineers, Data Scientists, Computer Vision Engineers, and Robotics Engineers who build and train models. Secondary Audience: AI Product Managers, Technical Program Managers, and Engineering Leads who are responsible for AI strategy, data acquisition, MLOps, and model deployment.
Prerequisite knowledge:
Attendees should have a foundational understanding of machine learning concepts and the model training lifecycle (i.e., what training data is and why it's used). Familiarity with the challenges of data collection and annotation (e.g., in computer vision) is highly beneficial. While no prior experience in 3D simulation or generative AI is required, a basic knowledge of Python and common ML frameworks (like PyTorch or TensorFlow) will be helpful for fully understanding the code examples and pipeline architecture.
What you'll learn?
This session will provide attendees with practical, high-level takeaways. They will learn to: 1. Identify the key bottlenecks in their own data pipelines where synthetic data can provide the maximum ROI. 2. Differentiate between the two primary SDG methods: physics-based simulation (for "ground truth" data) and generative AI (for augmenting reality). 3. Understand the architecture of a modern, scalable SDG pipeline using tools like NVIDIA Omniverse Replicator for generating labeled, domain-randomized data. 4. Appreciate how this approach solves critical "long-tail" or edge-case problems (e.g., rare product defects, dangerous driving scenarios) that real data often misses. 5. Gain actionable insights on bridging the "sim-to-real" gap to deploy models trained on synthetic data into the real world with confidence.

Profile

Senior Technical Program Manager, Nvidia Enterprise AI