Global Big Data Conference

Industry News Details

How Cerner Uses CDH with Apache Kafka Posted on : Dec 08 - 2015

Over the years, Cerner Corp., a leading Healthcare IT provider, has utilized several of the core technologies available in CDH, Cloudera’s software platform containing Apache Hadoop and related projects—including HDFS, Apache HBase, Apache Crunch, Apache Hive, and Apache Oozie. Building upon those technologies, we have been able to architect solutions to handle our diverse ingestion and processing requirements.

At various points, however, we reached certain scalability limits and perhaps even abused the intent of certain technologies, causing us to look for better options. By adopting Apache Kafka, Cerner has been able to solidify our core infrastructure, utilizing those technologies as they were intended.

One of the early challenges Cerner faced when building our initial processing infrastructure was moving from batch-oriented processing to technologies that could handle a streaming near-real-time system. Building upon the concepts in Google’s Percolator paper, we built a similar infrastructure on top of HBase. Listeners interested in data of specific types and from specific sources would register interest in data written to a given table. For each write performed, a notification for each applicable listener would be written to a corresponding notification table. Listeners would continuously scan a small set of rows on the notification table looking for new data to process, deleting the notification when complete.

Our low-latency processing infrastructure worked well for a time but quickly reached scalability limits based on its use of HBase. Listener scan performance would degrade without frequent compactions to remove deleted notifications. During the frequent compactions, performance would degrade, causing severe drops in processing throughput. Processing would require frequent reads from HBase to retrieve the notification, the payload, and often supporting information from other HBase tables. The high number of reads would often contend with writes done our processing infrastructure that were writing transformed payloads and additional notifications for downstream listeners. The I/O contention and the compaction needs required careful management to distribute the load across the cluster, often segregating the notification tables on isolated region servers.

Adopting Kafka was a natural fit for reading and writing notifications. Instead of scanning rows in HBase, a listener would process messages off of a Kafka topic, updating its offset as notifications were successfully processed.

Kafka’s natural separation of producers and consumers eliminated contention at the HBase RegionServer due to the high number of notification read and write operations. Kafka’s consumer offset tracking helped to eliminate the need for notification deletes, and replaying notifications became as simple as resetting the offset in Kafka. Offloading the highly transient data from HBase greatly reduced unnecessary overhead from compactions and high I/O.

Building upon the success of Kafka-based notifications, Cerner then explored using Kafka to simplify and streamline data ingestion. Cerner systems ingest data from multiple disparate sources and systems. Many of these sources are external to our data centers. The “Collector,” a secured HTTP endpoint, will identify and namespace the data before it is persisted into HBase. Prior to utilizing Kafka, our data ingestion infrastructure targeted a single data store such as an HBase cluster. View More

Get the