Industry News Details

Spark 3.0 Brings Big SQL Speed-Up, Better Python Hooks Posted on : Jul 03 - 2020

Apache Spark 3.0 is now here, and it’s bringing a host of enhancements across its diverse range of capabilities. The headliner is an big bump in performance for the SQL engine and better coverage of ANSI specs, while enhancements to the Python API will bring joy to data scientists everywhere.

In 10 short years, Spark has become the dominant data processing framework for parallel big data analytics. It started out as a replacement for MapReduce, but it’s still going strong even as excitement for Hadoop has faded. Today, it’s the Swiss Army knife of processing, providing capabilities spanning ETL and data engineering, machine learning, stream processing, and advanced SQL analytics.

Spark has evolved considerably since the early days. Few new applications today use the Resilient Distributed Dataset (RDD), which have largely been replaced by DataFrames. In concert with the shift to DataFrames, most applications today are using the Spark SQL engine, including many data science applications developed in Python and Scala languages.

The Spark SQL engine gains many new features with Spark 3.0 that, cumulatively, result in a 2x performance advantage on the TPC-DS benchmark compared to Spark 2.4. There are several reasons for this boost, including the new Adaptive Query Execution (AQE) framework that simplifies tuning by generating a better execution plan at runtime.

A blog post on the Databricks website explains the three main adaptive mechanisms in AQE. One of them is dynamically coalescing shuffle partitions, which “simplifies or even avoids tuning the number of shuffle partitions,” the Databricks authors write. Another technique, dynamically switching join strategies, can “partially avoid executing suboptimal plans due to missing statistics and/or size misestimation,” the company says. Finally, dynamically optimizing skew joins can help avoid extreme imbalances of work. View More