It is estimated that as much as 90% of the data we generate daily is unstructured. But what is unstructured data and why is it important? Unstructured data is exactly that – unstructured; it can be emails, word documents, excel sheets, meeting transcripts. This data can be outdated, incorrect, and/or unverified, all of which constitute potential security risks. All privacy regulations require companies to only keep data for as long as needed, after which it should be discarded.
Unstructured data is rarely useful to a company’s business goals. Moreover, without an effective means of managing this data, companies expose themselves to security and compliance risks. The latter is of particular concern as negligence in adhering to the growing regulatory requirements can lead to substantial fines.
Most privacy regulations, like the General Data Protection Regulation (GDPR) and the Protection of Personal Information Act (POPIA), stipulate that companies must limit the types of data collected, as well as the duration for which they can hold it. This need is magnified even more in the wake of catastrophic cybersecurity breaches in recent years. In the first half of 2021 alone, the top 10 breaches resulted in over 98.2 million individuals being impacted.
Today’s processes for managing unstructured data are manual and prone to error. To streamline these processes, data analytics and machine learning tools harness this data and enable enterprises to improve their processes and ensure that data are stored and discarded in a regulatory compliant manner. While theoretically, such tools can potentially transform the business into a future-ready enterprise, in reality, very few organizations have managed to successfully implement data governance practices and achieve their goals.
Structuring the future
The root cause of the data problems is the approach. Most businesses separate IT from the core business. Digital transformation initiatives, such as artificial intelligence (AI), machine learning, and more, are treated as technology projects, in which the software is implemented without taking the employees, business processes, and company structures into consideration. Consequently, the investments do not provide the desired ROI. To achieve the expected transformative change, a better strategy is to start with the end in mind; that is, before searching for the technology support, businesses must determine the end-result they wish to achieve.
Most projects aiming to clean up unstructured data follow three major steps:
- Scanning
- Filtering
- Remediating
The current breed of applications for unstructured data management has often failed to bring clarity to remediate unstructured data completely and accurately. Remediation often requires manual intervention, a potentially lengthy process as scans can return thousands of data points for an individual employee to remediate, which makes the work time- and cost-intensive.
A recent project HCLTech undertook illustrates the issue. Pilot environments were UK HR and LATAM GSS Payroll (Brazil, Chile, and Peru). In these pilots, HR files were attached to an employee, thus in theory, did provide some ‘structure,’ while the GSS Payroll had a wider range of files to grapple with.
The initial process followed the standard steps. It quickly became untenable since the scan in Brazil alone returned over 30 million files marked as remediation candidates. There were several reasons for this large number. First, the company’s culture was to ‘save everything in perpetuity.’ Then the chosen looked for patterns but did not recognize languages. For example, “gift” is a ‘present’ in English, but means ‘poison’ in German. The software scanned on folder level, making it impossible to use the granular labels, developed by HCLTech’s business analyst.
The HCLTech Perspective
This case study clearly shows that unstructured data management requires process reengineering from the very beginning. Only by doing this will you be able to manage the data and raise your privacy position.
Starting with the end goal in mind is critical to prioritizing technology that will eventually reduce time and effort. Rapidly running scans and filtering random data from unstructured sources is simply not enough to accurately fix all data-structuring issues. A holistic approach to data management that considers people, process, and technology, will be the deciding factor in ensuring greater positive gains are made, and the end goals are achieved successfully.