Back

Speaker "Chunky Gupta" Details Back

 

Topic

Live Aggregators: A reliable, scalable and cost-effective way of aggregating billions of messages a day in real time

Abstract

We discuss Mist’s real-time data pipeline, focusing on Live Aggregators (LA)—a highly reliable, fault tolerant and scalable in-house real-time aggregation system that can autoscale for sudden changes in load. LA consumes billions of messages a day from Kafka with a memory footprint of over 4 TB and aggregates over 600 million time series. Since it runs entirely on top of AWS Spot Instances, it’s highly reliable. LA writes the aggregated data to the configured system (either be Cassandra, S3, SignalFx or Kafka). LA does over 9 billion writes to Cassandra per day and maintains over 600 million concurrent state machines. LA checkpoints the state to s3 to recover it from incase of failures, and restart from the kafka message where it left off. This empowers LA to recover from hours-long EC2 outage ensuring no data loss.

Who is this presentation for?
Data Scientists, Infrastructure Engineers, Distributed Systems Engineer, Site Reliability Engineers, and Directors of Engineering

Prerequisite knowledge:

What you'll learn?
-Understand considerations for designing real-time applications that can autoscale for seasonal changes in load and achieve service-wide CPU utilizations of over 75% -Learn how Mist reliably maintains over 4 TB application state amid high server faults by checkpointing in AWS S3 and uses multilevel aggregation to solve aggregation problem across sharded data -Discover how Mist identified key metrics that served as inputs for its autoscaling engine -Hear lessons learned from building a highly scalable and reliable real-time aggregation system

Profile

Chunky is Distributed Systems Engineer at Mist Systems where he is working on scaling their Cloud Infrastructure. Chunky Gupta received his M.S in Computer Science from Texas A&M University in 2014. He worked with Yelp for 2 years as a Software Engineer and developed an autoscaling engine, FleetMiser, to intelligently autoscale Yelp’s Mesos cluster and saved millions of dollars. He presented FleetMiser at re:invent-2016. He has also scaled Yelp’s in-house distributed and reliable task runner Seagull. Chunky has a blog posted for Seagull at Yelp Engineering Blog. He has also built a hadoop-based data warehouse system at Vizury.