Speaker "Edward Schwalb" Details Back



Fast Scans for Continuous Variable Predicate Workloads


We consider workloads reducible to evaluation of continuous variables' predicates, for which a scan is required. We explore trading-off of storing re-usable predicate pre-computations instead of re-computing predicates for each entry of a full scan. Our approach is to introduce a generalization of column sketches to arbitrary n-ary predicates. The utility of such a generalization is, for example, identifying in which video frames a vehicle is mis-positioned. Upon receipt of a new query, a compilation step constructs re-usable components. Upon receipt of data, a build step produces a sketch representation of each newly received entry. Upon receipt of a {\em compiled} query, the scan is performed against the sketch data for $>$99\% of the entries, in a fashion similar to cache operation; when misses occur, the full predicate evaluation is performed. We mitigate the exponential complexity of query compilation and associated memory requirements using recursive domain decomposition. We provide a performance model that relies on metrics which can be measured on a small sample of the workload. Our experiments demonstrate that, in practice, for the 2-variable function we tested, using sketches with 7 bits at level 0 and 8 bits at level 1 achieves miss rates of $\approx 0.0001$. The empirical evaluation using a C-code implementation, scanning billions of entries with various sketch sizes, shows concordance with the performance model, and clarifies when $>$10x performance gains are possible.
Keywords: Contunuous Variables \and Predicate Evaluation \and Scan Accelerator \and Reusable Component Storage and Precompute Tradeoff and Sketches.

Who is this presentation for?
Architects and Engineers of Big Data systems. Executive who need to improve BigData analytics performance cost tradeoffs.

Prerequisite knowledge:
Talk is technical but accessible. BigData systems internals.

What you'll learn?
How to obtain >10x accel for full table scans.


Dr. Schwalb has received his Ph.D in Artificial Intelligence from University of California Irvine. He has more than 20 years of experience in implementing intelligent systems for a wide range of industries, including defense, consumer electronics, financial, and engineering. His data engineering experience includes building sizable financial data warehouses and automated load underwriting systems. He has authored a technical book, published in major journals, edited technical standards, and credited with more than a dozen patents. At MSC software, he was an architect of Apex, a product winning more than a dozen awards in 3 years. Currently he is charged with leading the MSC machine learning effort, including simulation tools for training and validating driving agents. His research focus is mathods to engineer inherently safe drivers, through quantification and validation of safety.