TL; DR: MTTD is the new MTTR

Part two in a multi-part series

How downstream incident prediction and root cause analysis prevent upstream service outages

Eliminating downtime is no longer a secondary priority. It is every bit as essential as providing the right features. The distance between competing digital services is one outage away. Even slow load times or partial outages cost millions per minute for businesses that rely on digital-first relationships with customers.


As a result, everyone in IT is now expected to understand and manage operations. MTTD, mean time to detect, is the new MTTR, mean time to resolve. The best service experience is the one that is never interrupted which means traditional user-facing service management governed by the principles of ITSM is now ceding focus to infrastructure-facing service management governed by the principles of ITOM.


This shift is catalyzing interest within service management teams to embrace operations as a core discipline. The rise of AIOps as an area of expertise is the result of service delivery managers spending increasing portions of their time monitoring infrastructure performance and availability. The leading ITSM vendor, ServiceNow, is experiencing phenomenal growth in automation-related disciplines to help customers achieve their vision of near-zero downtime.


Just as incident and problem management define user-facing service delivery strategies, incident prediction and root cause analysis define infrastructure-facing service delivery strategies. Thankfully, the leader in incident prediction and automated root cause analysis, InsightFinder, has partnered with ServiceNow to ensure availability-related issues are detected upstream before users are impacted downstream.


The partnership between ServiceNow and InsightFinder combines the strength of ServiceNow in workflow automation, discovery, CMDB, and service mapping with the strengths of InsightFinder in anomaly detection, stream processing, and unsupervised machine learning to feed automated insights into incident and problem management workflows. The integration works as follows:


  1. Unsupervised machine learning algorithms in InsightFinder detect anomalies across logs, metrics, traces, and change events.
  2. Detected anomalies are used to predict future incidents with enough lead time for operators to investigate in InsightFinder before users are impacted.
  3. To facilitate investigation, probable root cause analyses are used to determine the likely source of future incidents.
  4. Probable root causes are used to isolate CIs and services likely to be impacted based on service maps maintained by ServiceNow network discovery which feeds relationship data to the CMDB.
  5. Once infrastructure incidents are service-aware, service owners are notified by ITSM workflows in ServiceNow so status updates can be provided to stakeholders.
  6. Incidents and problems are triaged and integrated with change management processes in ServiceNow to ensure approved actions are taken proactively and archived to build change risk profiles.
  7. Change tasks in ServiceNow are used as inputs to InsightFinder machine learning models to improve the accuracy of anomaly detection and identify problems that are caused by change events before business gets impacted. 


This cycle of InsightFinder AIOps feeding ServiceNow ITSM yields continuous process improvement. Each iteration through the cycle combines the best of machine learning with human intelligence. Subsequent predicted incidents include more detailed historical records about what happened and what resolved it. Months into integrating AIOps with ITSM, typical customers report previously common outage patterns have been completely eliminated. Those customers convert downtime reduction into business benefits including reduced customer churn, reduced SLA penalties, and increased spend per customer.


Service owners once relegated to front office user support are being empowered to perform like operators thanks to insights about future incidents and probable root causes that previously required large teams of NOC operators and days of rigorous analysis.


The future of IT operations is a combination of automated anomaly detection using unsupervised machine learning plus automated service mapping and problem management. That future is available today from InsightFinder and ServiceNow.


To learn more about the power of InsightFinder and ServiceNow, request a demo to speak to our team.

Other Resources

Our unified Kubernetes collector gathers metrics, logs, traces, and events in real-time from a single aggregation point. KubeInsight leverages all

Observe your entire IT system health in real-time with one central view across all services, applications, and infrastructure. Catch production

Deploy our purpose-built AI platform to empower you and your teams with hours of advance notice. See how it works

Unified Intelligence Engine™ is the system that drives InsightFinder anomaly detection, root cause analysis, and incident prediction. It ingests and

A major credit card company’s mobile payment service experienced severe performance degradation on a Friday afternoon.