October 28, 2014


Big Data takes on Enterprise Reporting

Last week I had the chance to attend Strata’s Big Data conference in New York City. This is one of the marquis events on the subject of Big Data, possibly the largest of its kind. There were a few clear trends that I was able to observe at the conference, trends that will influence both vendors and adopters of Big Data in very distinguished ways.

First of all, Big Data is now a true commercial reality. This year’s conference made it clear through simple things like the type of venue (no longer a hotel but a sold out convention center), the number and the scale of the vendors present, and most importantly, the audience. A few years ago when this conference was getting traction, the audience consisted of developers in t-shirts, technology geeks and a few members of the press looking to write about a new technology trend. Today, along with all these usual suspects, the suits are in. Large businesses are not only investing but buying. Big Data is real and here to stay.

Another key observation that I made was the fact that the world of Big Data (from small vendors to the very large ones such as Microsoft, IBM, Google) has shifted its focus to become a viable option for the information space currently being ruled by RDBMS systems: enterprise reporting.

In the past few years, businesses have deployed solutions to address the 3Vs of Big Data: Volume - mostly as a TCO play (EDW offloading), Velocity - primarily by deploying MPPs appliances such as SAP HANA, Oracle Exadata or IBM Netezza, and Variety - by exposing new data typologies (clickstream, sensor, logs, social) to data scientists for analytics and machine learning.

While the application of Big Data around the 3Vs continues and will continue, there are key reasons for Big Data vendors to attempt the penetration of the enterprise reporting space.

The first reason is financial: 70 plus percent of information spending is still tied to traditional RDBMS applications (EDWs, MDM, Data Marts, etc.) supporting enterprise reporting.

The second reason is functionality: The majority of the use cases that businesses enable through BI and Analytics still use traditional data (structured).

The third reason is rather more technical. In traditional data warehousing, the business logic is either applied in the read (SQL) or in the write (ETL/Data Modeling). In Big Data, the logic tends to be applied primarily to the read. The advantage of Big Data applications such as Big Data Lake or Reservoir is that all data is collected and retained in its granular form.  On the contrary, in data-warehouse the data is summarized and eventually disposed of in the most granular layers. Therefore, when business users have new requirements or change their requirements, traditional data-warehouses struggle unless the data is already in the summarized layers and the logic in the read is not easily changed and new ETL and data modeling is required, making the development cycles long and cumbersome.

A Big Data Lake provides the most value when it can manage, ingest and retain all the data, and this is why vendors are eager not only to retain a portion of the traditional RDBMS data but its entirety.

These three reasons make the growth of Big Data contingent to the emergence of a Data Fabric over-compassing traditional data and new data types. However creating a Data Fabric able to support enterprise reporting is not an easy task, and this is why we now witness furious development and aggressive positioning of functions and applications essential to enterprise information such as - The challenge with these applications is maturity. Many are nascent; technology is unproven, particularly for an enterprise space, which instead is very mature and where stakeholders are undoubtedly not very keen to experiment with their most sensitive data.  Additionally, even in scenarios when technology can be proven suitable through POCs and POTs, users are typically left with another key challenge - ease of adoption.

What made applications such as Teradata so successful in the past fifteen years is the degree of out-of-the-box integration that the product offers.  The commercialization of IT has made buyers increasingly demanding for ease of use and ease of adoption. Software is consumed more and more as a service and clients have limited appetite for spending time in putting all the components together. This is where System Integrators such as HCL can help. HCL’s Big Data COE invests a tremendous amount of effort, packaging multiple technologies into integrated Frameworks that can provide end-to-end capabilities in key information management functions and a workable Data Fabric.

An example is HCL’s Data Movement Framework, which combines functions like traditional ETL with ELT, data synchronization for real time ingestion and technologies such as YARN, SPARK and SQOOP into a seamless solution that can ingest both structured and unstructured data.  Another example, which is key for enterprise information, is Data Quality and Data Lineage. There are several key offerings in this space from vendors such as ASG Rochade, Informatica, Waterline Data Science or RedPoint.  HCL developed a framework called MetaM to integrate these new metadata based functionalities into an easy to adopt solution. On the MDM and Data Quality front, HCL collaborates with NoSQL vendors such as MongoDB to achieve a single view of the customer across structured and unstructured data.  

While this is good news for most, it is important to notice that not all areas of functionality may be ready for prime time or meet all enterprise reporting needs. Data Security is still an uphill battle even though technologies such as Accumulo and Sqrrl and vendors such as Protegrity are making big inroads. System administration for Hadoop is another evolving area where emerging vendors such as Pepperdata are bringing to market solutions worth noticing.

Finally, while technology can be game changing, there are certainly other aspects of integration that will determine ultimate success or failure.  As the worlds of Big Data and enterprise reporting progressively converge, organizations will need to determine the optimal sequence of adoption - which components and functionality should co-exist versus which components can be integrated from the get-go. Operating models will need to adjust as machine learning and traditional BI, data scientists and business analysts, data engineers and EDW architects start to share the same canvas.   Investment models and IT roadmaps will evolve as investments in legacy are progressively replaced by investments in innovation and traditional information management develops into true data as a service.