January 30, 2015


Apache Storm in HDInsight

HDInsight: What is Storm? How do you use Storm in HDInsight? Why choose Storm over Spark streaming?

We can use Map/Reduce, Hive, Pig, HBase, etc. in Hadoop (as well as in HDInsight) to process and analyze Big Data, but all these components process data in batches and it becomes inefficient in processing real time data. Batch processing takes a lot of time (sometimes hours) to process huge amounts of stored data. To overcome this, there is a need for real time data processing.

In the current scenario people (or smart devices) communicate with each other to transfer data or events. For example, we can now control our home security system from our phone and control devices like the AC, TV, microwave, refrigerator, health monitoring device, sensors, watch, car, and more.

Imagine millions of devices working world-wide and generating petabyte data (or events) each minute (in coming years, this count could be doubled or tripled) and now we can see that batch processing would be very challenging for processing or analyzing such huge amounts of data. However, it is much easier to process and perform analysis on real time data. Apache Storm is a solution to help process real time event data.

What is Storm?

Apache Storm is a free and open source distributed real time computation system. It is fast and reliable enough to processes millions of tuples per second on live streams. Storm can be used with any programing language or machine learning language.

Recently, Microsoft unleashed Apache Storm on its Analytics cloud with fully managed Hadoop services. It brings near real-time analytics capabilities to HDInsight. The image below describes an end-to-end applicability of HDInsight Storm.

How to use Storm in HDInsight

Storm cluster in HDInsight can be provisioned in a single click. Just follow the steps below:

Go to Azure Management Portal à Choose HDInsight under Data Services à Choose Storm à Provide cluster details à Submit

Storm is different from Hadoop’s batch processing and usages topologies instead of MapReduce. Storm contains two types of nodes: head nodes and worker nodes. The core of Apache Storm is Thrift definition and Thrift can be used in any language, so Storm is language independent. But with HDInsight, it supports only three languages .Net, Python and Java.

SDK is available to help you develop an application with Storm. SDK (a Stream Computing Platform) can be downloaded from Storm cluster from %storm_home%\examples\SDK folder. Install it and then you will be able to find Microsoft.SCP namespace in your Visual Studio project. Put you stream processing logic in and build it to generate executable (.exe) file upload output to Storm cluster (e.g example folder).

In the next step, you will need to create a .spec file. It is a JSON kind of file in format.

Now connect your Storm cluster (take RDP) and run the following in the Storm command-line:

bin\runspec examples\Storm.spec temp examples\MyApp

Note: A Storm topology always runs until you stop it. To stop (kill) it, navigate in Storm UI (of your cluster) and kill it from topology action.

Why choose Storm over Spark streaming?

Here’s a quick comparison and reasons to use Storm over Spark streaming for the HDInsight platform.

  • Storm is distributed in real time computation systems. It is language independent whereas Spark is an in-memory distributed data analysis platform and is mainly used to speed-up analytics jobs or queries. However, similar to Storm, Spark streaming jobs run until shutdown by the user.
  • Storm and Spark can both run independently and have no dependency on Hadoop. However, Spark performs batch processing (micro batch processing in streaming) while Storm is more about real time streaming.
  • Microsoft is working (with Hortonworks) on Storm to optimize the HDInsight platform. Spark has a huger volume of open issues than Storm.