Back

 Industry News Details

 
Cloudera Picks Iceberg, Touts 10x Boost in Impala Posted on : Jul 02 - 2022

Cloudera is now supporting the open source Apache Iceberg table format in its cloud data platform, or lakehouse, the vendor announced yesterday. The move will help to ensure transactional integrity in the big data environments of Cloudera customers, while giving Impala queries a 10x performance boost. It will also give the Iceberg project more momentum to become the center of the open data ecosystem.

Apache Iceberg emerged several years ago to address data engineering issues afflicting users of the Apache Hive metastore, which continued to be used to manage data access and control in complex HDFS and S3 environments even as use of Hive’s SQL engine waned as faster query engines emerged.

Data engineers at Netflix and Apple were frustrated with several issues with the Hive metastore, starting with the lack of transactional integrity, which could wreak havoc in busy big data environments, where multiple teams accessed data with a variety of engines and services, including Presto, Dremio, Trino, Apache Spark, and Apache Flink, among others.

Without support for atomic transactions, customers could get the wrong answers when querying their Parquet tables, unless extreme pains are taken to ensure data consistency. “Quite simply, tables shouldn’t lie to you when you query them,” Iceberg creator and PMC Chair Ryan Blue, formerly of Netflix.

Iceberg addressed other issues with Hive too, including providing finer-grained file operations for data stored in object stores and support for in-place table evolution. The table format has been adopted by several big cloud vendors, including AWS and Snowflake, both of which announced support for Iceberg earlier this year. View More