Back

Speaker "Dhruba Borthakur" Details Back

 

Topic

The Aggregator-Leaf-Tailer architecture for low-latency queries on large datasets

Abstract

Most traditional big data systems store data in a columnar format and rely on sequential scans for query processing. These systems are optimized for efficiency and not for query-latency, they typically process data that are minutes-to-hours stale. Some other search systems, like Elasticsearch, build inverted indices on the data, which makes simple queries faster. However, they’re not able to satisfy complex queries, and the price-to-performance is a concern for bigger datasets. Dhruba Borthakur describes the ATL Architecture that can be used to power low-latency queries on your most recent data. The ALT architecture has three distinguishing features: (1) combining columnar and search indices into a single system called Converged-Indexing (2) separating the read and write path by adopting the CQRS pattern (3) Document-Sharding to optimize latency. Dhruba starts by presenting a brief evolution of the ALT architecture, starting from how it was originally created to power Facebook's Multifeed system and Linkedin's FollowFeed system to the current day when ROCKSET uses it to power operational analytics. Converged-Indexing: Dhruba describes the details on how Rockset uses Converged-Indexing to shred a semi-structured document into a columnar index, an inverted-index and a document-centric-index and store them inside RocksDB-Cloud key-value storage engine. He provides examples of how a powerful query engine can automatically decide which index to use while processing a query. CQRS: Dhruba explains how the ALT architecture is an implementation of the well-known Command Query Request Segregation (CQRS) pattern applied to a data processing engine. The ALT architecture separates the write and read code-paths of a data system, thereby allowing both streaming and batch updates while keeping query latencies low. Document-sharding: The ALT architecture employs a document-sharded approach of distributing data and is different from contemporary data systems that typically employ term-sharding to distribute data. Every query is automatically parallelized to execute on all nodes, thereby providing the lowest of latencies. Additionally, the ALT architecture has a multi-level query processing unit that implements complex primitives like sorting, aggregation, joins and filtering. ALT is an alternative to the popular Lambda architecture, especially for workloads that require very low latency queries on your most recent data.
Who is this presentation for?
Data engineers
Prerequisite knowledge:

What you'll learn?

Profile

Rockset, RocksDB-Cloud, HDFS