A considerable amount of resources and efforts are required for successful Big Data implementation. According to Gartner, organizations are not confident with their Big Data investments. The reasons are a lack of experimentation in data initiatives and experiments using ad-hoc technologies.
Decision paralysis in terms of “what to do with the data?” is a major reason for the lack of experimentation with data initiatives. It is like sitting and waiting on the sidelines while your competition is making the most of it. A bigger problem is when you know what to do with the data but not the tools which are available or which framework fits your use case.
There is an overspill of data processing frameworks out there and open source execution engines are no different. Major organizations have migrated or are currently in the process of migrating their existing data processing from OLAP systems to more sophisticated execution frameworks. It allows them to get crucial insights and ensures a competitive edge. If you still haven’t decided or if you are not confident with your chosen framework, then this blog will provide just the right information for you to make the right selection.
Why are we discussing open source execution engines? That’s because an execution engine is the basic building block for your platform to crunch the massive amount of data that is available. Moreover, the real advantage of Big Data lies in its ability to reduce costs and increase flexibility. Hence it is no surprise that almost all experimentation starts with open source and most organizations tend to try first rather than buy upfront.
Now let’s look at the things that need to be kept in perspective while selecting the execution engine:
- Use case
You have to assess if you need to use the data for canned reports which can be scheduled at regular intervals or whether the data is meant for interactive ad-hoc analysis. For batch processing, Hive can scale well using pre-defined queries which are not performance sensitive. For multiple data streams, Spark can be chosen while for interactive ad-hoc analysis, Presto and Hive with LLAP would be a good fit.
- Performance SLAs
Meeting SLAs can be a challenge if your data platform is not able to crunch and deliver. It can be catastrophic and could even lead to loss of business in certain cases. This makes it all the more important for you to assess whether you have made the right choice or not. Hive, due to its I/O intensive processing, is certainly not the right choice for obligations with stringent SLAs. Spark can be chosen with its in-memory capabilities which may help you meet the SLA, or even Tez for that matter.
- Storage options
Knowing your data platform and its available storage options can help you make the right decision. Consider a scenario where you have a Cloud data platform with its own prescribed storage option. However, the data processing use case at hand requires you to use an execution framework which is not compatible with the underlying storage. This means that now you will have to look for alternatives, thus, eventually devoting additional time and effort.
- Evolution Stage
Big Data frameworks are continuously evolving. A knowledge about their feature roadmap, current capabilities, and support options can help you overcome major roadblocks. After all, you may not want to venture with frameworks which are still in incubation with no professional support.
It is evident that execution engines are meant for specific use cases and one option or size doesn’t fit all. A deeper analysis of your primary and secondary use case can help create an ideal data processing platform.
Apart from execution engines, there are many other factors that may have an adverse impact on your Big Data investment. Having just the right technology stack is only the first step. It needs to be driven by your wisdom to understand the business needs, adequate planning for data growth, and a high level of maturity.
As the first blog in the series, we have tried to set the right context. Subsequent blogs are expected to talk about some other crucial aspects. If you are still deciding on the best data platform that suits your needs or if you are not satisfied with the current one you’re using and looking for an assessment, we can be your partner in this pursuit.
- Gartner Survey Reveals Investment in Big Data Is Up but Fewer Organizations Plan to Invest
- Apache Hive - https://hive.apache.org/
- Apache spark - http://spark.apache.org/
- Presto - https://prestodb.io/
- Hive with LLAP - https://cwiki.apache.org/confluence/display/Hive/LLAP
- Apache Tez - http://tez.apache.org/