Speaker "Paul Singman" Details Back



Big data tools


A data lake is primarily two things: an object store and the objects being stored. Even with the most basic setup, data lakes are capable of supporting BI, Machine Learning, and operational analytics use cases. This flexibility speaks to the strength of object stores, particularly their flexibility in integrating with a diverse set of data processing engines. As data lakes exploded in adoption, a number of improvements were made to the first architectures. The first and most obvious improvement was to file formats, which led to the development of analytics-optimized formats like parquet, and eventually modern table formats. An even newer improvement has been the emergence of data source control tools that bring new levels of manageability across an entire lake! In this talk, we'll cover how to incorporate these technologies into your data lake, and how they simplify workflows critical to ML experimentation, deployment of datasets, and more!
Who is this presentation for?
this presentation is for data engineers and data scientists looking to work more efficiently in data lake architectures.
Prerequisite knowledge:
basic understanding of object stores and big data processing patterns is useful.
What you'll learn?
You'll learn what cutting edge data lake architectures look like and what benefits they provide over first-gen data lakes.


Paul is a developer advocate for the lakeFS project where his aim is to better explain how to make data lakes more manageable. Prior to joining lakeFS, he spent several years working data lakes on the analytics team at Equinox Fitness, where he architected and built real-time recommender systems and a serverless data platform. He's spoken at various conferences and meetups, including the Data Council and AWS re:Invent, and is also a member of the AWS’ Community Builders Program. When not working you can find him running, playing golf, and drinking tea.