Per Data Science survey (Crowdflower’s Data Science Report), it is found that 80% of the time spent in data analytics is dedicated toward data preparation/processing. There are various processes involved in preparing data for consumption via business intelligence reporting, data analytics, and data visualization, among others. The way data is prepared will often determine whether the data will be consistent and accurate for consumers receiving reports or viewing dashboards. This guide will cover all aspects of the data management process— identifying data sources, understanding data quality, wrangling the data, integrating the data, loading the data, testing the data, applying data governance rules, verifying the results, and making the data ready for consumption via data analytics, data visualization, business intelligence reporting, etc.
For each process, we will explore what are the leading tools/technologies that can be leveraged and what are the HCL solution accelerators that will improve the productivity and reduce the data lifecycle timelines.
This guide can empower more business leaders to identify the tools and skills needed for their data management requirements, reducing reliance on IT. Additionally, this streamlines data processes because everyone in the business will be equipped with the knowledge needed to translate data and understand what decisions need to be made. After you read through this guide, you will have an astute understanding of the data management process that will help you stay ahead of the competition.
Per 2017 Gartner report, 60% data analytics projects fail due to non-alignment with business strategy or lack the right talent. Having a data literate staff saves time, energy, and money, which in turn, will help advance the current state of your operations. 85% of organizations see data as one of the most valuable assets to their business.
Poorly managed data can have some significant impact, so it is important to choose the right tool for the right data requirement (data process):
- Reporting on and complying with strict financial and data-retention regulations can lead to penalties.
- The analytics that is used to deliver value to the business will see a quality drop that impacts decision-making quality of the business.
- A lot of time can be lost in manual data processing/data preparation, meaning that many chances to add business value are gone.
- Inconsistencies across subject areas/applications can sneak into reports, with huge potential effects for the business.
|#||Task||Data Management Process (Terminology)||Leading Tools / Technologies||HCL Partnership||HCL Accelerators|
|1||Understanding data relationships- what, where, and how; with easy search||Data Discovery||Mentis (iDiscover), Denodo, Azure SQL DB||Alteryx, Mentis, Talend, Denodo, Microsoft Azure||iSee, MetaWisdom|
|2||Identifying DQ issues in the current system; detecting and correcting inaccurate data;||Data Profiling; implements data quality check; develops balance control and validation||Informatica DQ, IBM InfoSphere QualityStage/ Information Analyzer, Talend Data Quality, Microsoft Data Quality Services||Informatica, IBM, Talend, Microsoft||Advantage DQ|
|3||Data architectures address data in storage, data in use, and data in motion; descriptions of data stores, data groups and data items; creating a data model (Conceptual, Logical, Physical);||Data Architecture (Modeling)||CA Erwin Data Modeler, Idera ER/Studio, Oracle SQL Developer Data Modeler, IBM InfoSphere Data Architect||CA, IBM|
|4||Connecting to diverse data sources (Abstraction layer)||Data Integration (ETL), Data Preparation, Data Virtualization||Informatica, IBM, Talend & other ETL tools, Denodo, Azure Databricks, Alteryx, Collibra||Informatica, IBM, Talend, Denodo, Microsoft, Alteryx, Collibra||Sketch|
|5||Data processing- Complex (Multiple joins, complex filters, derivations, Pl/SQL-stored procedures in transformation, data generation, capture historical changes in data, batch/stream processing)||Data Integration (ETL)||Informatica, IBM DataStage, SAP BODS, Talend, Oracle Data Integrator, Azure Event Hub, Azure Data Factory||Informatica, IBM, Talend, Microsoft||Sketch, nervIO|
|6||Data processing (Simple to medium complexity)- with filters to eliminate bad records, data aggregation, column derivations/arithmetic expressions, transformations (Standardization, data type conversions, business to technical logic conversion)||Data Preparation; Data Virtualization||Alteryx, Trifacta Wrangler, Denodo, Microsoft Power View / Power Query, Talend||Alteryx, Denodo, Microsoft, Talend||Sketch, nervIO|
|7||Data consolidation, data migration onto the target platform (Database or files or web service (On premises)||Data Integration (ETL); Data Virtualization||Informatica, Talend, IBM, Denodo||Informatica, IBM, Talend, Denodo||Sketch|
|8||Data consolidation, data migration onto the target platform (On cloud)||Data Loading; Data Migration;||Informatica, Talend, Microsoft Azure Data Factory, AWS Glue ETLnbsp;||Informatica, IBM, Microsoft||Sketch|
|9||Integrating multiple sources of data working in silos as a unified layer; access data in real time; short time to development with less budget; data distribution via jdbc/odbc/ web services (REST/SOAP)||Data Virtualization||Denodo, Tibco DV||Denodo Tibco|
|10||Storing data in structured and unstructured formats (On-premises/ Cloud platform)||Data Storage||
Traditional: RDBMS like Oracle, SQL Server, Teradata
Big Data: Hortonworks, Cloudera, mapR
File/BLOB Storage in Cloud: Azure Blobs/Azure Data Lake Storage, Amazon S3, Snowflake
Structured Storage in Cloud: Azure SQL DB, Azure SQL Datawarehouse, AWS RDS, EDW on Redshift, Amazon DynamoDB, Snowflake DB/Warehouse
NoSQL DB: MongoDB, Cosmos DB
Event Streaming Platform: Confluent
REST APIs: JSON, XML
|Microsoft, Snowflake, Google, MongoDB, Aerospike, memSQL, neo4j, influxdb, Confluent|
|11||Principles and rules governing various types of data and maintaining data catalog, business glossary, data access controls, data privacy, data lineage||Data Governance||Collibra, Alation, Informatica, IBM, Talend Data Catalog, Denodo Data Catalog, Azure Data Catalog, AWS Glue Data Catalog||Collibra, Alation, Informatica, IBM, Talend, Denodo, Microsoft||Data Marketplace (DMP)|
|12||Maintaining a "single version of truth"- customer, product, material, etc.||Master Data Management, Reference Data Management||Informatica MDM, Informatica Customer 360, Informatica Product 360, Tibco EBX, stibosystems||Informatica, Tibco, stibosystems|
|13||Managing the life cycle of data from curation to retirement and certifying the trusted data; managing data quality||Data Stewardship||Collibra DGC, Informatica MDM, IBM Stewardship, Talend Data Stewardship||Collibra, Informatica, IBM, Talend|
|14||Data retention and data archival||Data Archival/Deletion||Informatica Data Archive (ILM), IBM InfoSphere Optim||Informatica, IBM|
|15||Protecting digital data by authentication and authorization||Data Security||Authentication - Windows LDAP, Active Directory, Azure Active Directory, Kerberos, SAML, OAUTH 2.0 Authorization - Database, ETL, Data Preparation, Data Virtualization internal product feature - Enable SSL, Data Access Controls, Imperva||Imperva|
|16||Data encryption; data anonymization||Data Masking||IBM InfoSphere Optim, Imperva Data Masking, Mentis (iMask), Informatica Data Privacy||IBM, Imperva, Mentis, Informatica|
|17||Interactive and modern visual representation of data in charts/dashboards||Data Visualization||Microsoft PowerBI, Tableau, Qlik, ThoughtSpot, Tibco Spotfire, Information Builders WebFocus, Yellowfin||Microsoft, Tableau, Qlik, Thoughtspot, Yellowfin, Information Builders|
|18||Data representation (Descriptive)||Data Reporting||Tableau, Microsoft, SAP, MicroStrategy, Tibco Spotfire, Qlik, r3||Tableau, Microsoft, Tibco, r3||Visual Library|
|19||Data analytics (Diagnostic, Predictive, Prescriptive)||Data Science and Machine Learning Platforms||SAS, Alteryx, Dataiku, Azure Data Lake Analytics (ADLA), HDInsight, Azure ML, Python, R Studio, Spark||Alteryx, Dataiku, DataRobot, Databricks, H2Oai, ARRIA||PerisKop, Embedded Analytics|
|20||To understand, manage, and maintain about data- Data Dictionary||Metadata Management||Informatica, IBM, Talend, Denodo||Informatica, IBM, Talend, Denodo||MetaWisdom|
|21||Test Data (Pre- and Post-Loads); Compare source vs target data sets||Data Testing, Data Reconciliation||Custom ETL Testing||GateKeeper|
|22||Monitoring/auditing data||Data Auditing/Monitoring||DI / DV tools in-built feature||iSee|
|23||Test data generation||Test Data Generation||Informatica, IBM, Talend||Informatica, IBM, Talend|