Back

Speaker "Alex Sergeev" Details Back

 

Topic

Distributed Deep Learning with Horovod

Abstract

Learn how to scale distributed training of TensorFlow, PyTorch, and Apache MXNet models with Horovod, a library designed to make distributed training fast and easy to use. Although frameworks like TensorFlow, PyTorch, and Apache MXNet simplify the design and training of deep learning models, difficulties usually arise when scaling models to multiple GPUs in a server or multiple servers in a cluster. We'll explain the role of Horovod in taking a model designed on a single GPU and training it on a cluster of GPU servers.

Who is this presentation for?
Deep learning engineers, ML infrastructure engineers, technical decision makers

Prerequisite knowledge:
TensorFlow, PyTorch, or Apache MXNet

What you'll learn?
Approaches for scaling deep learning training, how to improve separation-of-concerns between DL engineers & ML infra

Profile

Alex Sergeev is a staff engineer at Uber working on scalable deep learning. Previously, he was a senior engineer at Microsoft working on big data mining. He received his master's degree in computer science from National Research Nuclear University's Moscow Engineering Physics Institute.