Industry News Details

How machine learning strengthens incident management Posted on : Oct 13 - 2021

As systems failures pile up, machine learning stands as an alternative to improve response quality and save money. Learn the benefits and drawbacks to the approach.

Digitization, AI and machine learning lead to complex autonomous and adaptive systems that operate with little-to-no human intervention. While such systems are potent drivers of business growth, they are incredibly challenging for IT and DevOps teams to debug and diagnose in the event of infrastructure or application failures.

And the financial ramifications of any system failures have multiplied, as these "smart," data-driven applications are now central to business operations. Thus, it is understandable, yet paradoxical, that to manage and debug modern IT infrastructure increasingly requires machine learning (ML) to identify, diagnose, fix and prevent problems.

Automated incident management and AIOps

Machine learning for incident management is a subset of AIOps, a process in which AI is applied to a wide array of IT operations tasks.

Many of those tasks fall under event correlation, analysis and incident management, where data analytics and ML modeling can reduce significantly the time required to diagnose and fix problems when applied to an aggregated repository of system, security and application data. Furthermore, by encapsulating subject matter expertise and with powerful mathematical techniques, machine learning-augmented IT support software improves the quality of incident response output in systems usable by less experienced IT support professionals.

Problems with modeling data

The wide variety of causes for a service or application outage require distinct approaches. Causes include configuration changes, software updates or patches, equipment failures, external network congestion -- for applications that rely on cloud services -- or malicious attacks, such as distributed denials of service, data corruption or system compromises.

Various approaches to these scenarios typically fall into a few categories:

Data clustering and correlation. To associate similar events and link cause to effect. For example, a configuration change to a network outage due to improper routing information.

Anomaly detection. Detecting any unexpected divergences from the continuity of data streams or normal patterns.

Data fitting and prediction. Using both traditional and more advanced statistical methods.

Deep learning. Consisting of training neural networks on known, categorized data and using trained models to analyze new incoming data. View More