In the past few weeks, I have met with many of our customers, and I see a clear trend in the making with respect to data and analytics. Global enterprises now expect the following when it comes to data lifecycle management:
- Access to all data, both in real time and in batches
- Ability to link disparate data sets
- Ability to handle huge volumes of data
- Faster integrations
- Fast query response times
- Self-service capability
It’s quite a surprise when your customer tells you that they know their business well, have defined their KPIs, and know most of their data, but are not able to link the data meaningfully or scale-up due to platform limitations, awareness, etc.
There is a lot of ask around how do we bring all the data together (integrate), design it meaningfully (use case), deliver it quickly, and be able to effectively perform analytics on top of the data. If I read between the lines, I see a requirement to also efficiently add newer sources of data (internal, external, and social) into the mix.
A few aspects to be considered include the following:
- All data in: Any data, any format, and from any source can be ingested into the data lake using HCL’s DMF (Data Movement Framework).
- Use-case driven models enable KPI metric analytics at subject-area level as well as enterprise level: Think of it as enterprise-level KPIs (conformity at enterprise level) and local-level KPIs (at subject-area level). The models need as much historical and current data available for better training and deep learning to predict better Halls of Fame or Halls of Shame concept.
- Efficiently add newer data sources (scalability): At the platform level, scalability is critical to add newer sources of data arising out of organic and inorganic data growth. The concepts around source-independent layers become important and must be kept in mind when models are being designed.
- D4 Data strategies:
- Data models need to be designed in a way that will allow real-time and batch data loads at the same time without compromising on the enterprise view of data concepts. In this sense, it is critical to follow idem potency, data vault, and lambda principles effectively without compromising the efficacy of the data models
- Storage strategies, including Hot, Cold, On-Prem, On-Cloud, and Hybrid: Predominantly, most organizations prefer cloud based solutions; however, adequate thought must be given to PII, GDPR, local laws, etc. as these may influence data load and storage strategies in terms of encryption while in motion and at rest
- Access strategies, including fast data, slow data, and DaaS: While fast data models need high performance/accessibility with little enrichment, thought should be given to the costs associated and that’s the reason slow data on durable storage tier is available at lower costs.
- The DaaS (Data as a Service) concept helps in ensuring that a pub-sub model be created to ensure that data models are made available in multiple formats to consuming applications
- Importance should be given to data life cycle management, including archival/purging analytics which includes making data strategies available for sciences, deep learning, machine learning, etc. One of the key thoughts to be kept in mind is that for any (data science, deep learning, or machine learning) algorithm to be successful, the DATA is key. The data models and the ETL/ELT processes should enable design and development of business specific algorithms.