Recent Posts



No tags yet.

Stream Data processing using Spark and Kafka

If you have been in the IT industry for quite a sometime now, then you may have come across Dashboards in software applications that display various metrics like System Health information dashboards, Stock Market rates, Traffic information, etc. Traditionally the refresh intervals for such Dashboards have been in hours. The reason, the way legacy systems were built. In such systems, there would be a ETL process that would run say every hour, which will feed the collected data to a batch processing application which would process it and send the result to the dashboard. Such a process takes a long time to complete from Data Extraction to Displaying the Processed Data. The Software Industry has evolved quite a lot in the past 6 to 8 years. And we now have platforms like Apache Spark and tools like Apache Kafka that will cut this refresh interval of dashboards from hours to minutes or even seconds.

So let me start with what is Apache Kafka and Apache Spark.

Apache Kafka: Its a high-throughput distributed messaging system. Its strengths are as follows:

*High-Throughput & Low Latency: Even with very modest hardware, Kafka can support hundreds of thousands of messages per second, with latencies as low as a few milliseconds.

*Scalability: A Kafka cluster can be elastically and transparently expanded without downtime.

*Durability & Reliability: Messages are persisted on disk and replicated within the cluster to prevent data loss.

*Fault-Tolerance: Immune to machine failure in the Kafka cluster.

*High Concurrency: Ability to simultaneously handle a large number (thousands) of 
diverse clients, simultaneously writing to and reading from Kafka.

Apache Spark: Its a distributed processing platform. Some like to call it as “Lightning Fast Cluster Computing”. Its strengths are as follows:

*Spark uses memory for executing process. Hence it is able to achieve faster batch processing than MapReduce. Spark executes batch-processing jobs 10 to 100 times faster than MapReduce.

*Spark is ideal for iterative processing, interactive processing and event stream processing. Spark is ideal for iterative processing, interactive processing and event stream processing.

*Spark can run on Hadoop alongside other tools in the Hadoop ecosystem including Hive and Pig.

*Spark provides Spark-shell a command line tool that can be quickly used for prototyping.

I have created a data processing framework using Kafka, Spark, Flume and Hive to build a dashboard for displaying Health of Linux Servers in a Data-Center. Kafka used as a message collector, Flume acting as the Producer, which collects health stats from the server, Spark as the processing platform and Hive as the Data Store. For creating the Kafka Cluster, I have set up a multi-broker environment with 3 brokers to handle 1 topic and 3 partitions. For Apache Spark I have set up Spark on a 20 node Hadoop Cluster.

(408) 505-5499

2603 camino ramón Suite #200, San Ramon, CA 94583, USA

  • googlePlaces
  • facebook
  • linkedin

©2016 by Swatcloud