Back

Speaker "Osman Sarood" Details Back

 

Topic

How to Cost Effectively and Reliably Build Infrastructure for Machine Learning

Abstract

Mist Systems consumes several Terabytes of telemetry data every day coming from its Wireless Access Points (APs) deployed all over the world. A significant portion of our telemetry data is consumed by our machine learning algorithms, that are essential for the smooth operation of some of the world’s largest WiFi deployments. At Mist, we apply machine learning to incoming telemetry data to detect and attribute anomalies, which is a non-trivial problem and requires exploring multiple dimensions. Although our infrastructure is small compared to some of the tech giants, it is growing very rapidly. Last year, we saw a 10X growth in our infrastructure, taking our AWS annual cost over $1 million. In this talk, we present how we kept our annual cost to $1 million rather than $3 million (i.e., 66% reduction in cost), using AWS spot instances while keeping our infrastructure reliable. Attendees will learn: 1-How to select the right EC2 instance types, i.e., compute versus memory intensive 2-How much over-provisioning (extra capacity) is needed for ensuring reliability 3 The impact of different types of applications, i.e., stateless and stateful, on 1 and 2 above 4-Key aspects for building real time applications that reliably run on top of spot instances 5- How to monitor real time applications in the presence of a high number of server faults due to spot instance terminations Includes a demonstration of terminating random production hosts and: -How we detect when a machine is terminated - How applications running on terminated hosts can recover seamlessly - Visualization of the impact on all the applications running on terminated hosts.

Profile

Osman Sarood, Infrastructure and Operations Lead, Mist Systems, received his PhD in High Performance Computing from the Computer Science department at the University of Illinois Urbana Champaign in Dec 2013 where he focussed on load balancing and fault tolerance. Dr. Sarood has published more than 20 research papers in highly rated journals, conferences and workshops. He has presented his research at several academic conferences and has over 400 citations along with an i10-index and h-index of 12. He worked at Yelp from 2014 to 2016 as a Software Engineer where he prototyped, architected and implemented several key production systems that have been presented at various high profile conferences. He presented his work, Seagull, at the prestigious Amazon Web Services (AWS) annual conference, reInvent in 2015. He architected and authored Yelp’s autoscaled spot infrastructure, fleet_miser, which was presented at AWS reInvent 2016. Dr. Sarood started working at Mist in 2016 and is leading the infrastructure team to help Mist scale the Mist Cloud in a cost effective and reliable manner.