Analytics is changing the world. The impact of AI is clearly visible across various industries. Unlike earlier, when most of the analysis was descriptive in nature (dashboard and reporting), modern AI is mostly prescriptive. Analytics is making in-roads into very niche domains, ranging from medical research, fraud detections, predictive maintenance, etc. Most of the data scientists solving these problems are not domain experts and hence don’t have a deep understanding of data. In this blog, we will open a discussion to explore the right approach for solving such problems using datacentric AI.
As the 80/20 rule states, 80% of the entire analytics lifecycle time is spent simply on finding, cleansing, and organizing data. The remaining 20% of the time is used for the algorithm, model, and coding. But, if you look at the skill set of such analytics projects (or organizations), it is mostly centered on ML models, coding, algorithms, and hyperparameter tunings. This is the gap that we would like to discuss in this blog.
A quick glance at all the recent activity in the analytics space will point to the fact that most of the research, debates, and conversations are happening toward improving the ML Models, their algorithms and hyperparameters by a small delta. But data-centric tools and technologies are lagging far behind. So the gap is evident. More than 90% of time and effort is spent on improving the ML models (algorithms and hyperparameters) and merely 5-10% effort is spent on data pre-processing. As per our experience in this domain, we have seen a similar trend. It is not the lack of good algorithms or ML model fine-tuning capabilities that are restricting us from building better ML models, but the size, interpretation, and pre-processing of training data. If, as data scientists, we get a good volume of clean and meaningful data, any reasonable algorithms code from the internet will be able to generate a good accuracy model.
Some of the common issues with data that need to be investigated to get better ML models are as follows:
- Scattered data
One of the most common issues with data is that it is scattered across varied systems and different formats and schemas. Accessing this data and preparing it for ML is a tedious process and demands specialized skills.
- Labeling subjectivity
For a supervised learning problem, there is a need to label the data correctly and efficiently. Now, labeling might be a simple job or a very complex one depending on what domain you are targeting. Access to SME (Subject Matter Experts) might be a challenge, or they might become a bottleneck in the entire pipeline.
- Data size
Depending on the problem at hand, there might be small or large samples. For example, you can easily get a lot of data on the phone users or internet users in an area to do some analysis. But if you are looking for a medical diagnostic use case for a rare disease, then the samples might not be too many. So, it is always better to have a huge set of data to perform a better ML Model training. Another option can be to create synthetic data for training purposes.
- Data balance
For an ML model to do better training, it is recommended to have balanced data set. For each category or a classification model that needs training, there should be sufficient data points. For example, if you are looking to create a mail classification model, there might be loads of samples of useful mails, but a very less number of spam mails to train the ML model on.
- Data drift
Data tends to drift based on time, location, and loads of other factors. For example, sales in the store during the entire year follow a trend. All the various ups and downs should be captured in the data used for training purposes. If there is a drift happening after training of ML Model, then the model should be re-trained on new data.
So basically, the intent and recommendation is to move from “HUGE data to CONSISTENT data”
- Consistency in definition – Y should be defined clearly to assist in better labeling
- All categories (or Classes) should be sufficiently represented – X data set balanced
- Timely feedback on production data drift
- An optimum data size for the model to be trained effectively
Based on the above analysis, we recommend that AI developers shift their attention from model and algorithm to quality of data. Once the ML model is trained, the data scientists are benchmarking and checking the efficiency and accuracy of the model. If they find that the accuracy is not up to the mark and can be improved further, generally, they will start looking at changing the algorithm for training or fine-tuning the parameters of the already existing algorithm. This is a tedious process and results in only a very small improvement, if any. The alternate and recommended approach can be used to improve the data quality and leave the algorithm as static. In most cases, the “data-centric” approach has yielded better results.
Benefits of a datacentric approach
- Cheaper – lesser effort by data scientists. Lesser compute used
- Less time-consuming
- Easier to understand and fix
When a data-centric approach can’t be used?
- When data is coming from an external entity and we do not have a sufficient understanding of the data
- Expensive – if the SMEs needed for data labeling are rare and expensive to hire
- Synthetic data – this approach sometimes requires huge computing power and is hence expensive
So, it is clearly evident that the quality of training data (or input data) using a datacentric approach has a direct and huge impact on training an ML model. But, before adopting the datacentric ML model, we must ensure that there are proper quality checks in place to ensure that the data is of the highest quality and standard. In conclusion, if we use the analogy of an AI project as a car, then the clean high-quality data is the fuel that is needed for the AI project to run and succeed.