Speaker "Ram Sriharsha" Details Back
-
Name
Ram Sriharsha
-
Company
Databricks
-
Designation
Product Management
Topic
Scaling Genomics on the Cloud
Abstract
Next generation sequencing is becoming cheaper and more accessible. The volume of data sequenced is increasing faster than Moore’s Law. However, it is still expensive and slow to go from raw reads to variant calls, and to produce annotated variants that can then be analyzed downstream. In this talk, we will discuss the first state of the art, scalable and simple DNA sequencing workflow that is built on top of Apache Spark and the Databricks APIs. The pipeline is simple to set up, is easy to scale out, and can sequence a 30x coverage genome cost efficiently on the cloud. We’ll introduce the problem of alignment and variant calling on whole genomes, discuss the challenges of building a simple yet scalable pipeline and demonstrate our solution. This talk should be of interest to developers wishing to build ETL pipelines on top of Apache Spark, as well as biochemists and molecular biologists who wish to learn how to develop cheap and fast DNA sequencing pipelines.