Global Big Data Conference

Industry News Details

Spark And Kafka Are Powering Today's Modern Data Apps, So What Will Happen To Hadoop? Posted on : Aug 12 - 2019

I spoke briefly about the evolution of modern data apps in a previous Forbes piece on the use of big data in the cloud. Essentially, the history of big data can be divided into thread broad mega-phases. Phase 1 was experimental, with organizations tinkering around with MapReduce, Pig and other native Hadoop services (choices were limited) to explore the technology’s basic capabilities and see what sort of value it could offer. This tinkering was done mostly by Google, Yahoo and a handful of other major web companies.

Phase II saw the separation of storage and processing, as well as the use of cloud to deploy big data (perhaps most notably in the form of Amazon EMR and Microsoft HDInsight). During this era, organizations started using Hadoop, Spark and S3 together to start driving real value from their big data in the form of applications such as recommendation engines and fraud detection.

Today, we’re in Phase III, which is characterized by a rapidly expanding ecosystem and the adoption of advanced big data services to derive even greater value and more fine-tuned use cases. Spark and Kafka are two of the newer technologies that are defining this phase and reshaping the big data stack. What’s driving their success and how will the rise of Spark and Kafka impact older data technologies?

The Growth Of Spark And Kafka

Spark and Kafka have seen rising adoption over the past few years as a result of booming interest in steaming applications, data science and artificial intelligence/machine learning. Both are key big data technologies that support applications in all three areas. Spark is an extremely fast, open-source processing and analytics engine that’s ideal for large quantities of real-time data. Kafka is also an open-source stream-processing platform, but it’s used mostly for transporting data between systems, applications, data producers and consumers. Both are efficient, quick, low-latency technologies geared toward leveraging streaming/real-time data.

Streaming apps produce and/or rely on a constant flow of streaming data (common examples include recommendation engines and internet of things (IoT) apps). Data science typically uses streaming data (rather than batch data) to provide rapid insights. Similarly, artificial intelligence/machine learning models leverage streaming data to constantly train and learn. In short, these three things make heavy use of streaming data across the board. Spark and Kafka are involved in processing, analyzing and transporting this data. As steaming applications, data science and machine learning have taken off, so have Spark and Kafka. View More

Get the