September 2, 2015

628 Views

Big Data - Myth Busting Service

I am sure many of you would have heard about the 3Vs that characterize “Big data”. I think every other site explains them nowadays ever since Big Blue doled it out. So I won’t elaborate. I am more interested in the “C” that accompanies the Vs. I am talking about the C of Confusion created by the plethora of options that accompany Big Data.

One of these common confusions has its roots in Big Data Marketing. Many customers approach Big Data with speed in mind. These customers find it hard to stomach the fact that Batch based systems (like Hadoop, Hive) are not really intended for interactive reporting. Not many realize that all these Vs are applicable only to Data and not really about “processing” of data. Nonetheless, by selecting the right tool with the right options, we can always make a successful big data story!

In this article, I am making a case for Business Service that can provide Technology clarity and help Enterprises make optimal Big Data Technology decisions. To illustrate the point and bring out the relevance, I am taking a case of failed big data solution and the case of “Reporting on Big Data”.

A Botched Up Big Data Solution

Consider this case :

The Customer’s requirement was to ingest data from devices and information systems installed in their client premises, develop analytic and reporting solutions and deliver through the Cloud.

The Customer approached us very recently after a failed Big Data PoC attempt with one of their Vendors. We reviewed the architecture and found an inappropriate use of NoSQL technology. The data was completely structured though different clients had different data format. And the customer wanted to add new columns later without affecting reporting on older data.

This is not such a tough thing to do. Data formats can always be standardized through guided pre-processing. New columns can be added to existing tables without having to go through NoSQL. With Hive 0.15.0, this feature is available for storage formats like AVRO as well which paves way for an efficient implementation. Also, NoSQL was not designed with reporting in mind.

Why did it go wrong?

If this was so easy, why did it go wrong? That is simply because of the staggering amount of options that the Big Data ecosystem throws at you. It is not possible for someone to figure out Plus and Minuses of all tools and use them judiciously. Add to this the fact that Tools are evolving. The role of a Big Data architect is pretty demanding and not many can stand that pressure.

Reporting – Problem of Plenty

Let us now look at Reporting and the problem of plenty that accompanies it. Reporting is done through BI Tools and SQL Support (JDBC/ODBC) is fundamental to it. Here is a look at different options for reporting on Big Data.

Hive, the oldest SQL on Hadoop solution, is known to be slow as it relies on M/R for providing the execution engine capability. It connects to BI tools but is very slow.

 

There are many SQL on Hadoop alternatives mushrooming in the Big Data ecosystem to handle these interactive reporting requirements. E.g. Apache TEZ, Spark-SQL,Spark Execution Engine for Hive, Cloudera Impala etc. For Un-Structured data, Apache Drill (Backed by MapR) makes a good case (for e.g. exploring JSON data).

Be it Hive, Impala, Spark – All these “compute” every time they are invoked. Now it makes sense to cache frequently executed queries (or) pre-compute certain queries and serve them through SQL. That’s (approximately) what Apache Kylin project tries to do. This new arrival to Apache incubator, is an extreme OLAP computation engine for Apache Hadoop that uses pre-computed data-cubes to serve SQL queries.

Each of these options have their own +ves and –ves. Hive does not make a case for interactive reporting. It is slow but proven. Hive supports Complex Data-types (e.g. map, struct) and can process unstructured data (e.g. hierarchical data like JSON).

Impala is quite fast but lacks support for Complex data-types. Impala is backed by Cloudera and so it nails your Hadoop distribution as well.

Spark-SQL is just evolving. We need to wait and see how it proves itself.

Apache Kylin is very new and needs a cube-designing and computing phase before realizing the speedups. Kylin can be useful when data grows to several hundred terabytes. Row-scans/Partial row-scans can prove very costly when data grows to such sizes.

HBASE and MongoDB are not designed for connecting to BI tools/reporting. They simply power OLAP/OLTP type applications. However, there are connectors being developed to enable reporting on top of these. E.g. Apache Phoenix, MongoDB trying to get reporting friendly.

The Case in Point

So, what’s the point – You may wonder.

The point is that - Just a simple case of “reporting” is eliciting so much options from the Big Data world. We have not even discussed different Big data Distributions, Hadoop on Cloud, Hadoop As Service, Data processing tools, Storage Formats, Compression options, Scheduling, NoSQL, Graph databases, Real-time Streaming, Analytics, Other Applications, Performance, Scale and Interoperability. What is the right combination for your application? Some enterprises want to experiment and find this. In fact, at HCL, we helped one of our customers by preparing a multi-terabyte Hadoop cluster on HCL Internal Cloud, experimented with a slew of combination of options (for caching, compression, storage format, data size etc.) and reported the best performing options for a selected range of queries that our customer was interested.

Just imagine the following

  1. Imagine a service that takes your architecture blue print and sets up the data-pipeline in a public cloud with different Distributions/Tools, artificially simulates data and tells you what combination will rock your queries and apps – with strong numbers and proof!
  2. Imagine a service that allows you to compare Cloudera Impala backed by Parquet format and Snappy Codec with Spark-SQL – specifically for your use-cases. The resultant knowledge can translate to quicker time to market / better customer experience / faster reaction times!
  3. Imagine a service that alerts you on a new feature added recently in a tool (say Apache Drill) that makes it a better choice than your current reporting solution!

If you have come all the way until here, let me ask you. Would you be interested in a Big Data Myth Busting service like that? What kind of stuff would you like to see in such a service? Please leave your thoughts below. Thanks!

References

  1. Apache Hive - http://hive.apache.org/
  2. Apache Tez - http://tez.apache.org/
  3. Spark SQL - https://spark.apache.org/sql/
  4. Spark Execution Engine for Hive - https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark
  5. Cloudera Impala - http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html
  6. Apache Drill - http://drill.apache.org/
  7. Apache Kylin - http://kylin.incubator.apache.org/index.html
  8. Apache Phoenix - http://phoenix.apache.org/
  9. MongoDB BI connectivity - https://www.mongodb.com/press/opens-modern-application-data-to-new-generation-visual-analysis-and-traditional-bi-tools