Data used in machine learning models and common pitfalls to watch out for
Today, Machine learning needs no introduction, with its prevalent use in driving new initiatives across various industry domains. Due to its powerful pattern identification, machine learning has been used in different scenarios like fraud detection, aiding medical diagnosis, and also in molecular physics. For all its brilliant uses, there have also been different cases where machine learning models may not have lived up to perceived expectations. This occurs especially when false positives, incorrect data sets used for training rising from inept data collection, and bias in training data, among other factors, seem to have become all too familiar, giving rise to numerous machine learning challenges.
There are so many algorithms to choose from for a given situation, each with its own parameters (hyper parameters) which if not tuned or fine-tuned adequately, and backed by proper data collection practices, can provide different or erroneous conclusions and lead to wasted effort.
When it comes to machine learning, there is always a need to provide large data sets of information, which can be used to adequately train and test the given machine learning model. These datasets can range from a few to a large collection of parameters, each with the objective of providing enough information, which a machine learning model can use to accurately classify or predict based on a given objective.
Taking an example from the digital payments domain, a global payment processor, that has around two billion cards in use and over 150 million transactions per hour, with each of the transactions being examined in real time for fraud, needs an exceptional amount of data that needs to be processed. In addition, the usage of rules, around spending habits, location, and other patterns—items being bought and where they are being bought, among others—can lead to large amounts of datasets being curated and constantly updated to provide accurate predictions.
According to common literature, there are typically five-six distinct steps that are applied as part of the machine learning framework. These are:
- Data Collection/Data Gathering – The amount and type of data that is collected/gathered will determine how good the machine learning model will be.
- Data Preparation – Data comes with missing information, duplicate data values, as well as other errors and inaccuracies which will need to be corrected. This cleaning up of information is part of the data preparation process. This is also a good time to perform some exploratory data analysis to determine if there are identifiable patterns that need to be considered, which may affect the outcome. Good practices which are included as part of data preparation, include randomizing data and splitting the data collected into training and testing data sets.
- Some of the good practices that can be performed during exploratory data analysis include – performing calculations for min, max, and average values of a particular feature or variable and if there are similar counts of positive or negative labels for continuous/categorical values. Perform checks to ensure that the cardinality (right number of values) is available for a feature or variable – as an example, there cannot be more than one height per person in a dataset.
- Based on our collective experience, data visualization of individual features or variables aids in identifying anomalies with the individual values being considered.
- Choosing a Relevant Model – There are different algorithms that can be applied for different uses, but determining the correct model is an important factor. Scikit Learn Documentation has a cheat sheet that can assist beginners in identifying the correct model (https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html)
- Training the Selected Model – From the data preparation step, we use the training data as identified to train the selected model, which will tell the model what to expect as input and what should be the corresponding output that should be determined.
- Evaluating the Trained Model – Once the model has trained successfully, it is now time to evaluate the model. Evaluation is done using a metric or a set of metrics such as accuracy or recall. A normal practice can involve using an 80/20 (train/test) split of the data prepared. 80% of the data prepared can be used to train the model and 20% of the data prepared can be passed as a test to the model. Since the classification or prediction is already known to the user, it will make it easier to verify if the machine learning model is able to correctly classify or predict accordingly. Based on the machine learning model’s ability to correctly classify or predict, the accuracy or recall metrics can be used to evaluate if the model is working per expectations or not.
- Tuning the Model – Each algorithm, as may be used, has a set of default values that is applied. Each of these default values can be tuned or altered to ensure that the algorithm improves its performance. Number of training steps, initialization values, and distribution values to be applied are examples of parameters that can be tuned. Once the model has been adequately tuned, it is now ready to make the required classification or prediction.
As we can see from the various steps above, data identification, data gathering, and data preparation are some of the key steps and garners much of the attention required for a successful machine learning model. Given that that data is the key, there are some challenges that are present which need to be taken into consideration. These are:
- What data is being used and how closely does it resemble the relevant real world scenario?
- How can changes in the data, used to train machine learning models, affect real world scenarios? In other words, can they be used to aid fraud?
- Are there clear standards to be followed, such that in any scientific endeavor, there are clear markers on what works and what controls, checks, and error measurements need to be in place, for any machine learning model to be considered as “accurately providing the required information”?
Datasets used for machine learning
As stated above, data used to train the machine learning model can be a small set of variables to a large collection of variables. While it is probable that the data that is being used for testing may not change much in a real world scenario, it is not the case often.
In a real world scenario, it has been observed that the data set used to train a machine learning model needs to undergo constant evaluation and the machine learning models need constant training on new data patterns, as and when they are identified. This could be due to false positives or in response to new behaviors as may be experienced in the real world, which may not have been taken into account previously.
This scenario needs to be taken into account whenever machine learning models are being deployed into production. There will be a constant need, to identify how often the model needs to be trained on new data and whether the new data that is being used reflects the real world scenarios. This calls for a robust data management and data quality plan that will provide the necessary information for the current and possibly, future data requirements. As per our collective experience, improved data management and quality will automatically lead to increased trust in data reliability and data utilization.
In addition to the above, it is also important to ascertain and formulate alerting mechanisms which can be triggered in case a data validation fails. Data validation is a part of data management plan.
Can datasets used during machine learning aid fraud?
While strong data management and data quality plans can effectively improve the data quality and effectiveness of data as used within the data models, it is also important to verify patterns within the data being used. Small changes in the data patterns provided to the data models can effectively lead a machine learning model to incorrectly pass fraudulent transactions as real transactions. While such a scenario may not be common place, it is imperative that we take note of the data that is being fed and ensure that the data that is being passed is not being compromised at any level. Detection of such patterns is known to be difficult; however, strong data governance practices should negate such occurrences.
In addition to data governance practices, understanding of the data being used, either by performing feature-based analysis or data lifecycle analysis, would be important.
Feature based analysis includes:
- Identifying possible data “slices”/combination of data which can lead to the model performing poorly or adequately.
- Looking for possible “skew” on some of the data “slices” or combination of data. As an example, for a normal distribution, we could effectively fold the graph into half and notice no variation with the data on the left or the right. An example of data with skew (negative or positive) would be one when a graph is folded in half and you observe that the variation of the data is greater to the left or right, depending on the skew.
- Employing some techniques, like data cube analysis, to analyze data “slices” or combination of data, where aggregate statistics and evaluation metrics can be computed and verified for each of the data “slices”.
Data lifecycle analysis includes:
- Possible identification of dependencies of features.
- Possible identification of sources of data related errors.
- Constant comparison of new data with previously available data, using the chi-square test (to verify whether there is a relationship between two variables, where a null hypothesis indicates no relationship between the two variables) or the t-test (to verify whether two groups of data are statistically different from each other, where a null hypothesis indicates no statistical difference between the means of the two groups).
Can datasets used introduce bias into the models inherently?
Bias, when included in a data model, can lead to classification or prediction that can be discriminatory in nature. To avoid such scenarios, it is essential to check if there is any under-representation or over-representation of the data within the variables being used in the data set. It is also critical that you verify and remove the patterns that may no longer be applicable within a given scenario and re-train the machine learning model. At this point in time, it is not upto AI or machine learning models by themselves, to identify bias – it is the responsibility of those individuals or teams that put the data together in the dataset, to identify and remediate this occurrence, if any. For example, to identify bias, a machine learning model may identify a group of flowers, says roses, more than another group of flowers, says daffodils. This could be due to majority of the data set referring to roses, rather than daffodils. If the dataset can be corrected to fix this anomaly, then the model should have no difficulty in correctly identifying the respective flowers, based on the data provided.
Are there clear standards to be followed while building machine learning models?
Unfortunately, there are no clear standards that are applicable presently. However, it is up to the teams who devise such models to provide the required documentation on how to calibrate the models, detect errors, and provide the limitations of the models created. The appropriate checks and balances around controls, validity checks, and error measurements must be clearly established and documented, so that the users of these models can use them correctly and question them appropriately.
To summarize, we have seen the explosion in the advancement of R&D in the areas of computation power, increase in data availability, and different algorithms being researched and made available. It is our collective responsibility to utilize these advancements effectively and without adverse effects in our respective organizations.