Back

Speaker "Aaron Williams" Details Back

 

Topic

Speed Meets Scale for Predictive Analytics: Running Billions of Data Points Through a Full Machine Learning Pipeline

Abstract

Use of the humble GPU has spiked over the past couple years as machine learning and data analytics workloads have been optimized to take advantage of the GPU’s parallelism and memory bandwidth. Even though these operations (the steps of the Machine Learning Pipeline) could all be run on the same GPUs, they were typically isolated, and much slower than they needed to be, because data was serialized and deserialized between the steps over PCIe. That inefficiency was recently addressed by the formation of the GPU Open Analytics Initiative (GOAI http://gpuopenanalytics.com/), an industry standard founded by MapD, H2O.ai and Anaconda. This group created the GPU data frame (GDF), based on Apache Arrow, for passing data between processes and keeping it all in the GPU. In this talk we will explain how the GDF technology works, show how it is enabling a diverse set of GPU workloads, and demonstrate how to use Python in a Jupyter Notebook to take advantage of it. We’ll demonstrate on a very large dataset how to manage a full Machine Learning Pipeline with minimal data exchange overhead between MapD’s SQL engine and H2O’s generalized linear model library (GLM).

Profile

Aaron is responsible for MapD’s developer, user and open source communities. He comes to MapD with more than two decades of previous success building ecosystems around some of software’s most familiar platforms. Most recently he ran the global community for Mesosphere, including leading the launch and growth of DC/OS as an open source project. Prior to that he led the Java Community Process at Sun Microsystems, and ecosystem programs at SAP. Aaron has also served as the founding CEO of two startups in the entertainment space. Aaron has an MS in Computer Science and BS in Computer Engineering from Case Western Reserve University.