The adoption of AI is becoming a common phenomenon across the industry. Companies are in different stages of AI maturity, and, depending on the level of adoption, they are seeing varying levels of impact. While there is a lot of excitement in the rapid adoption of AI, many companies may be oblivious to some of the inherent risks regarding AI models. Stakeholders are increasingly consuming insights from AI applications (from less critical ones to mission-critical ones) for decision-making. This can have larger implications unless it is managed well. History is privy to instances when things have gone wrong, and the 2008 financial crisis is a case in point.
AI models have become integral to decision making in many organizations.
The world woke up to shock and surprise on 15th Sep. 2008, when Lehman Brothers Holdings filed for bankruptcy. That led to a global financial crisis and trillions of economic losses across the world. It was a systemic failure caused by poor risk management, over-leveraged markets, unrealistic analytical models, and a lack of regulatory and executive oversight. An interesting instance is the infamous Gaussian copula function approach suggested by David Li, which recommended a valuation of multiple financial securities based on the credit default correlation factor among them.
Financial institutions rapidly adopted such simplistic correlation models for determining the pricing of CDOs (Collateralized Debt Obligations)/CDSs (Credit Default Swaps) where the underlying asset class is subprime mortgages, corporate loans, etc. Such simplistic and narrow models built using data from favorable market scenarios are not representative of the variations of interconnected complex financial markets. Scenario modeling, risk assessment, and stress testing of models for adverse market conditions were not part of the process then, and we are aware of its consequences and impact after that.
The model is as good as the data on which it is trained on. The learnings from the 2008 crisis led to the strengthening of model governance, model testing, and model risk assessment. Laws were passed to ensure transparency of underlying decision-making models, enforcement of model testing by third parties, frequent regulatory reporting, audits, etc. A crisis was required to unearth systemic risks, limitations of analytical models, and the need for new processes for better governance.
A decade later, we are currently using a lot of AI models to process new types of data like images, videos, text, and streaming device data. We have seen instances of many models failing like the tragic accident involving Uber’s self-driving car, Facebook’s chatbots developing their own language, Microsoft’s AI Chatbot corrupted by Twitter trolls, IBM Watson providing unsafe oncology treatment recommendations, and Apple’s Face ID detection system fooled by 3D Mask, etc.
While there are considerable advances happening on the model algorithm front, we are also witnessing a lot of threats in the form of training data poisoning, adversarial attacks through physical and digital perturbations, exploitation of system vulnerabilities, etc. Scenario planning, threat assessment, and model testing are very much required to assess the capability, limitation, and vulnerability of models.
AI models have become integral to decision-making in many organizations. With rapid adoption, there could be occasions where the models get into production without appropriate testing and validation, and that could lead to dangerous consequences. Model testing and risk assessment are key considerations for a successful implementation of AI strategy. It is important to test the models for various market scenarios and conditions. With AI adoption peaking across sectors, it is important that we adopt the learnings from the 2008 financial crisis and other recent model failures. There is no need to reinvent the wheel.
The uniqueness of testing AI systems
AI systems are fundamentally different from traditional software systems. In traditional software systems, developers code the logic which is subsequently used for processing data, and it provides the desired outcome or an expected behavior. On the other hand, AI systems use underlying training data to build business logic. The logic learned by the AI systems is subsequently used to process new data for generating the output. The other challenge with AI systems is that the output is probabilistic in nature, whereas it is deterministic in the case of traditional software systems. The probabilistic nature of AI models makes creation of test oracles even more difficult, and that requires due attention. The complexity further multiples when handling unsupervised, deep learning, and reinforcement models.
Testing AI models
As per the US Federal Reserve’s supervision letter SR 11-7, “Model validation is the set of processes and activities intended to verify that models are performing as expected, in line with their design objectives and business uses. Effective validation helps to ensure that models are sound, identifying potential limitations and assumptions, and assessing their possible impact. All model components— inputs, processing, outputs, and reports— should be subject to validation; this applies equally to models developed in-house and to those purchased from or developed by vendors or consultants”.
With more data and more models being created, there is a need for continuous monitoring and evaluation of such models to ensure it delivers the intended results with little or no risk. A test strategy must be created which would define the list of features to be tested, type of tests to be conducted, the test environment, test management tools, entry and exit criteria, etc. Model testing is key to ensure that model risks are identified, then informed to stakeholders, and finally, are mitigated. Model testing focuses on testing at every stage of model development, as explained below:
Data exploration phase: This phase involves understanding the data for its distribution, quality, sampling approach, summary statistics, thresholds, feature engineering, transformations, imputations, anomalies, labels, assumptions, biases, etc. While descriptive analysis and data visualization would help to get a basic understanding of the data, detecting anomalies and defining thresholds would require statistical techniques like IQR (Interquartile Range) and ML techniques like k-Medoids Clustering, Naïve Bayes Classification, DBSCAN, One-Class SVM etc. Data understanding and exploration are key for defining test scenarios and test oracles. It also helps to define those edge test cases where the model might provide ambiguous results. Documentation of such test scenarios is key for effective risk management.
Model training and validation phase: This phase involves feature engineering and experimentation with different models to derive optimal results. Model testing would cover verification of variable correlation, multicollinearity, assumptions around the variable selection, model selection, and evaluation of model performance metrics like Training vs. Validation Accuracy, ROC, AUC, Specificity, Sensitivity, F-Score, R^2, RMSE, MAPE, etc. Acceptance criteria are to be defined for various metrics based on the business criticality of the use case. The criteria should clearly define probability beyond which a model output is accepted or rejected. While these macro metrics are monitored, it is equally important to deep dive and assess the behavior and variation of these metrics for specific prediction classes and data samples.
Model deployment phase: In this phase model is optimized for inference, hardened, and packaged for deployment in target environments. Models are deployed as APIs, web services, batch jobs, etc. The end-to-end data and ML pipeline is established along with the appropriate set of operations and monitoring tools. Integration testing is required to validate that every module and the pipeline is robust, and the output is compatible for consumption by downstream systems.
Model in production phase: This is the most challenging phase. The model will encounter a new set of data that it may not have seen before but is still expected to perform and provide the desired results. A key consideration of the model’s performance is how effectively it generalizes such new data and provides meaningful decisions. Constant monitoring is required to check the model’s performance, and it is also important to look out for model drift due to concept drift and data drift. Techniques like back-testing are also required to check for deviations, errors, and risk, the event of which needs to be flagged.
Effective testing of models across the phases helps to identify issues, defects, sub-optimal performance, and underlying risks. UnBox.ai, HCLTech’s AI model testing framework, encompasses a comprehensive approach to test AI models in every stage of their development, implementation, and sustenance phases. Since AI models are predominantly black boxes, it is required to unbox them and understand their internals by making them more explainable and by getting knowledge about model weights, variable importance, biases, decision boundaries, etc., for effective test design.
UnBox.ai, combined with its other accelerators like DataGenie for test data generation, Spotter DQV for data quality measurement, TranSecAI for model transparency and security check, helps to accelerate model testing. Unbox.ai evaluates models across nine dimensions and provides a comprehensive assessment and risk rating. The nine dimensions are termed as D-CoDFIERRS and encompasses testing for Data Privacy, Correctness, Drift, Fairness, Interpretability, Efficiency, Relevance, Reproducibility, and Security. The framework helps to identify defects in every stage of the models’ life cycle and provides a view of its suitability for use in the field. It helps to determine whether models are ‘fit for use’ or ‘needs improvement’ or ‘unfit for use’ with appropriate findings and recommendations.

Figure 1: The D-CoDFIERRS of UnBox.ai
As AI evolves and becomes integral to every aspect of a business, it is mandatory that every model is extensively tested so that stakeholders are aware of its purpose, limitations, and risks so that they can make informed decisions. Policies and regulations are coming up, which will hold the board and senior management of organizations accountable for the outcomes of AI applications. While making the best of AI implementations is crucial, it is important to implement the right set of checks and balances to ensure there are no negative consequences. To create a world of beneficial, responsible, and risk-free AI, adopt UnBox.ai. For more details, please contact NEXT.ai@hcl.com.