Historically, metadata has been captured, stored and utilized by enterprises in a variety of ways, including enterprise architecture, information taxonomy, and data asset management. Metadata is ubiquitous and available across the enterprise in applications (COTS), enterprise architecture, databases, data modeling, application integration (inbound and outbound), EDI, ETL, reporting, and metadata capture tools.
Enterprise architects and data governance specialists have proposed the capture and effective use of metadata for enterprise benefits. However, the exercise slowly but steadily loses steam and metadata capture ultimately becomes one of the check boxes during architecture and data governance reviews. Not much thought has been given to the effective use of the metadata and how it can further the data first principle of making the ‘right data available consistently, at a speed at which the business can benefit from its use’.
In the following paragraphs we will cover how metadata driven by AI, will proliferate the data management eco-system holistically.
The metadata captured from systems of record, systems of integration, and systems of innovation can be classified as application, integration, data quality, and usage metadata. The situation in an enterprise is typically of constant change, indicating that application development in terms of new features being added or modification or modernization is a given. While effort is taken to keep all stakeholders in loop of the changes, it can prove difficult to ensure from a release management perspective that the changes are being applied across the enterprise. This would lead to cases where the changes are not reflected either in the systems of record or integration or innovation, leading to a break in the process established, causing issues like ETL load failures and failures in report refreshes etc.
It is for this purpose that metadata needs to be kept current by applying the change data capture used in ETL scenarios. Changes to metadata can take forms like changes to a definition of an existing attribute or entity or list of value, new attributes added to an entity or enablement/ modification of a new feature to an application, process changes either in ETL or API or micro services or orchestration, changes to business rules in applications or data quality or monitoring etc.
There is merit in keeping metadata current so that any changes upstream or downstream will be captured and proactive actions can be undertaken along the lineage of data flow within the enterprise.
One of the many purposes of capturing metadata is to identify data lineage from the systems of record to the systems of innovation along the way creating data catalogs, which are basically logical sets of data persisting within the data management eco-system.
Combined with machine learning algorithm proactive actions can be initiated as soon as changes occur in the lineage. Using themes like schema validation, changes to data structures can be identified. Since data lineage is captured, identifying the impact of a change to a data structure is possible along the lineage from systems of record to integration to innovation.
The machine learning algorithm can use the learnings from schema validation and based on lineage details (data structures, ETL jobs, reporting, analytics, and APIs) ensure that all relevant jobs/schedulers are updated with changes, as well as raw data schemas (alter tables). This will ensure that data lineage from systems of record to systems of integration (raw data layer) is always current. The process is workflow enabled enforcing enterprise governance policies.
It is critical that all relevant upstream (systems of record) changes to data governance across the enterprise land into a raw data layer so that the process from data generation to the landing zone is fool proof. Once the raw data layer is updated, based on use cases, the data can be transformed and published to downstream applications, reporting solutions, analytics, and APIs following the established process. Access to the raw data can also be made available to data scientists, operational reporting, etc. in a pub-sub model.
A case in point is a large bank in Europe that has region specific data lakes that follow regional rules and when consolidation at the bank level happens, there are data management issues associated to differences in data structures followed, aggregation rules associated to critical data attributes (based on definitions), etc.
The solution provided was the creation of a metadata data lake (MDL). The MDL sourced all the metadata from systems of record, integration, and innovation. The purpose of creating the MDL was to ensure that metadata information for all the regions is housed in a single repository and the repository is kept current as explained above. The machine learning algorithm sources the changes from the MDL and performs the necessary updates from systems of record to integration raw layer in an automated manner. The financial institution’s goal is to have two data lakes at the end of the transformation journey, one being the MDL and the second being a global consolidated data lake powered by metadata and machine learning.
Another case in point is a Quick Service Restaurant (QSR), undergoing massive transformation to increase its revenue generation by incorporating ideas in the short run like Curbside delivery, self-service delivery, and catering orders through newer channels like web, mobile app, Alexa, and Grub hub. The enterprise landscape is changing with a change in the e-commerce platform, a menu management application, billing applications, implementation of MDM solution (product and customer), etc. One can imagine that with so many moving parts the data management eco-system will need to be re-architected.
However, while the re-architecture of the application and data landscape is under-going (transformation) the QSR has to continue its business operations thru its 1000+ stores. The solution provided included the creation of a metadata repository that is updated regularly based on changes that are occurring in the application landscape and impact the data landscape. The machine learning algorithm identifies the changes in the applications, and also identifies how these changes impact the operational reporting and necessary changes are done to the raw data layer and the operational reporting metrics scheme so that the reports are refreshed at scheduled time-intervals.