Speaker "Ram Sriharsha" Details Back



Scaling Genomics on the Cloud


Next generation sequencing is becoming cheaper and more accessible. The volume of data sequenced is increasing faster than Moore’s Law. However, it is still expensive and slow to go from raw reads to variant calls, and to produce annotated variants that can then be analyzed downstream. In this talk, we will discuss the first state of the art, scalable and simple DNA sequencing workflow that is built on top of Apache Spark and the Databricks APIs. The pipeline is simple to set up, is easy to scale out, and can sequence a 30x coverage genome cost efficiently on the cloud. We’ll introduce the problem of alignment and variant calling on whole genomes, discuss the challenges of building a simple yet scalable pipeline and demonstrate our solution. This talk should be of interest to developers wishing to build ETL pipelines on top of Apache Spark, as well as biochemists and molecular biologists who wish to learn how to develop cheap and fast DNA sequencing pipelines.


I am a Product Manager at Databricks. I am the engineering and product lead for the unified analytics platform for genomics. Prior to Databricks I worked on Scalable Machine Learning at Yahoo as a Principal Research Scientist. My interests are in genomics, machine learning, online learning and big data analytics.