Speaker "Jim Dowling" Details Back
-
Name
Jim Dowling
-
Company
KTH - Royal Institute Of Technology
-
Designation
Professor
Topic
Scaling Tensorflow to 100s of GPUs with Spark and Hops Hadoop
Abstract
In June 2017, Facebook shook the deep learning community by showing how they could reduce training time for deep neural networks on the ImageNet dataset from 2 weeks to 1 hour. For Deep Learning practitioners outside hyperscale AI companies, the next great frontier is distribution. Distributed training and parallel experiments (hyperparameter optimization) offers the potential for both more productive data scientists and reduced time-to-market for new models. In this talk, we will navigate through the jungle of distributed TensorFlow frameworks, many of which leverage Apache Spark for manging distribution. We will describe the two dominant architectures: the parameter server model and the Ring/Allreduce model, and popular frameworks based on those models: TensorFlowOnSpark and Horovod, by Yahoo and Uber, respectively. We will also investigate integration with resource managers, and how GPUs-as-a-Resource are supported in the Hops Hadoop platform, enabling both enterprise and commodity GPUs to be securely shared among teams. Finally, we will perform a live demonstration of training for a distributed Tensorflow application written on Jupyter that can read data from HDFS, transform the data in Spark. We will show how to debug the application using both Spark UI and Tensorboard, and how to examine logs and monitor training. All code and datasets presented will be 100% open source.