Type to SearchView Tags
Meshach Samuel

The AI frontier: driving reliable and stable IT operations
Meshach Samuel Europe Solution Head, Digital & Analytics Practice | February 26, 2021
204 Views

In 2020, I came across an article that talked about how Artificial Intelligence (AI) is expected to be the new catalyst for software development. The article stated that artificial intelligence-powered software development tool providers had raised more than USD 700 million in just 12 months. And this was before COVID-19 compelled enterprises to undertake rapid digital transformation.

Of course, this move forward has been accompanied by an accelerated growth in artificial intelligence adoption. According to HCL’s Digital Acceleration report, artificial intelligence has catapulted to become one of the biggest drivers of technology investment for business and IT leaders globally.

AI’s Role in IT Operations

With the increasing adoption of the concepts of Site Reliability Engineering (SRE) in mainstream enterprises, automation is becoming more intrusive in IT operations. In this blog post, we shall explore the prevalence of AI and Machine Learning (ML) in application IT support, rather than in infrastructure support. I believe application IT support is a more complex problem to solve.

Let’s look at the three types of tickets such as service requests, incidents, and alerts that typically get created in IT operations and consider how AI and machine learning is used to handle each type.

Service Requests

Service requests handling has the most common use of AI/ML because Standard Operating Procedures (SOPs) can be created easily for such tickets. Once we have an SOP, Natural Language Processing (NLP)-based understanding and classification models with Robotic Process Automation (RPA) can enable automated resolution of these tickets unless authentication is required. In such cases, opsbots (chatbots) could be an alternative for self-service portals. Chatbots also bring an added advantage of helping visually challenged people.

Incidents

Incident handling can be categorized into three use cases: Recovery, resolution, and prevention. Let’s start with the first; where we look at AI and ML as it is used to facilitate rapid recovery in the aftermath of an incident.

Recovery

Today infrastructure as code, service mesh, containerization, and micro-services architecture are becoming the norm. Automated recovery using AI/ML ensures HA (high availability) in these applications or platforms. This might include, but is not limited to, autoscaling of applications based on model rules, automated mission control operations such as segmentation, backpressure, and bulkhead creation among others. These remediation techniques can be applied automatically through AI/ML. These are achieved by integrating simple pattern recognition models with relevant actions that are automatically executed.

Resolution

Incident resolution involves routing, triaging, and remediating the incident.

Routing: For any conventional incident resolution cycle, identifying and routing the ticket to the right person or resource to resolve the problem is a typical waste. This is when lean management principles are applied on IT operations value stream

AI optimizes the ticket allocation process by referencing data from all previous ticket allocations – from the service desk to the various operations teams. It also takes into consideration existing information of ticket hops that have taken place previously. With the ability to automatically categorize a ticket using natural language processing and ticket type, allocation to appropriate teams is seamless and fast. In certain cases, these tickets are assigned to the exact engineer whose code base was problematic. This was possible using AI/ML and, the ability to trace an error back to the actual engineers based on backward traceability established by matured CI/CD practices

I recall my experience of working with a global retail giant struggling with a very high number of rerouted tickets. They needed to reduce the number of rerouted tickets and cut back on the resolution time. We approached the rerouting issue by using previous rerouting data to train the AI/ML models. These learnings were then fed into a vectorization model to classify subsequent requests. This proved to be an effective solution. Through continuous learning, the AI model increased the first-time successful allocation rate from an initial 30-35% to 91% of total cases within three months.

Triaging: This step in the resolution process takes the maximum time and effort in IT operations. AI/ML is helping operators triage incidents faster through the use of conversational UI-driven intelligent KeDBs. This enable semantic searches, advances in observability which provide 360-degree view of the state of dependent systems or actors during the incident, and suggestions of possible remediations based on semantic patterns.

Remediating: Notification in triaging would most likely lead to suggestions on remediation as elicited above. In matured cases, such prescribed remediations agreed by the operator are also monitored to eventually enable straight-through-remediation or self-healing. This is still quite rare in application operations space where SOPs are hard to come by for incidents.

Prevention

So far, we have been exploring how AI models and ML can help in the resolution of an incident. But how do we prevent incidents before they can even occur?

Preemptive resolution of possible incidents is perhaps one of the most ambitious applications of AI models in IT operations. Achieving something like this depends on learning models that can identify the strongest indicators, causes of an incident risk and the degree of threat. When it comes to preventing incidents, AI and ML can be used to model and predict systems behavior based on a range of parameters that we can analyze.

AI and ML can be used to model and predict systems behavior based on a range of parameters that we can analyze.

Preemptive resolution is perhaps one of the most ambitious applications of AI in IT operations.

At HCL, we use three distinct models to predict systems behavior depending on the level of maturity of the available data. These are:

  • Probability distribution which focuses on internal two-dimensional data
  • Topological data analysis which focuses on internal multi-dimensional data
  • Game theory which focuses on both internal as well as external multi-dimensional data

These systems’ behavior models leverage historical data to predict if a problem could occur in a particular system. This prediction, in turn, can alert teams to take proactive measures or corrective actions in a dynamic scenario. These actions could include scaling infrastructure, changing the load balancing configuration, or simply introducing added layers of monitoring to prevent issues from even occurring.

When it comes to changes to an existing system, the operations team use AI/ML to assess system behavior objectively before signing-off for release — all done in an automated way. Two such recently used techniques were mutation testing and resilience engineering (chaos engineering).

Alerts

In contrast to traditional IT systems monitoring, AI and ML can be used to observe a system from a business-down perspective. AI/ML is used to correlate events from various monitoring tools and make an inference of business capability/sub-process behavior. Intelligent alert aggregation reduces the number of alert tickets. It also helps in identifying the real source of an alert and thereby reducing discovery, triage, and remediation time for such alerts. Another outcome of this approach is eliminating any unforced errors in ticket prioritization and allocation. This in turn, saves costs and allows the operations teams to focus on areas that need more immediate attention.

Conclusion

From detecting anomalies to suggesting ways to remediate them, AI and machine learning models that analyze data patterns in systems have shown the potential to streamline every phase of operations and development. They find their place in most of DevOps and SRE implementations. But as it has become evident in my experience, the value of any technology is only as good as its implementation. That will continue to be the key differentiator for effective AI and ML adoption in an enterprise.

By understanding the underlying datasets and adopting appropriate AI/ML models, we can realize benefits of at least 55% reduction in tickets, 45% reduction in operators, and 70% improvement in NPS scores for IT operations team.