Back

Speaker "Jim Dowling" Details Back

 

Topic

Scaling Tensorflow to 100s of GPUs with Spark and Hops Hadoop

Abstract

In June 2017, Facebook shook the deep learning community by showing how they could reduce training time for deep neural networks on the ImageNet dataset from 2 weeks to 1 hour. For Deep Learning practitioners outside hyperscale AI companies, the next great frontier is distribution. Distributed training and parallel experiments (hyperparameter optimization) offers the potential for both more productive data scientists and reduced time-to-market for new models. In this talk, we will navigate through the jungle of distributed TensorFlow frameworks, many of which leverage Apache Spark for manging distribution. We will describe the two dominant architectures: the parameter server model and the Ring/Allreduce model, and popular frameworks based on those models: TensorFlowOnSpark and Horovod, by Yahoo and Uber, respectively. We will also investigate integration with resource managers, and how GPUs-as-a-Resource are supported in the Hops Hadoop platform, enabling both enterprise and commodity GPUs to be securely shared among teams. Finally, we will perform a live demonstration of training for a distributed Tensorflow application written on Jupyter that can read data from HDFS, transform the data in Spark. We will show how to debug the application using both Spark UI and Tensorboard, and how to examine logs and monitor training. All code and datasets presented will be 100% open source.

Profile

Jim Dowling is an Associate Professor at the ICT School at KTH Royal Institute of Technology, a Senior Researcher at SICS RISE, and CEO of Logical Clocks AB. He received his Ph.D. in Distributed Systems from Trinity College Dublin, and has worked at MySQL AB. He is a distributed systems researcher and his research interests are in the area of large-scale distributed systems and machine learning. He is lead architect of Hops Hadoop (www.hops.io), the world's most scalable Hadoop distribution, and only Hadoop distribution that supports GPUs-as-a-Resource. He teaches the first and largest course in Sweden on Deep Learning, ID2223, and he is a regular speaker at AI / Big Data industry conferences.