In traditional ITOps organizations, only a few metrics matter – amount of downtime, revenue loss, and of course, mean time to repair, or MTTR. Why would reducing MTTR no longer matter? Two possibilities: if we eliminated the problems to begin with (ideal state) OR more likely, if we could detect and resolve the problem before it caused the system to go down.
Mean time to repair starts with responding to a negative incident: there is a problem that needs to be repaired. It is an inherently reactive process. How many false alerts came across before the system went down? How many hours on the bridge call? How long until you truly find the root cause of the problem?
There are many tools in the market and yet outages persist and financial impacts continue. The complexity of the environments today require more than a few monitoring or observability tools.
An outage not only inconveniences customers, but also takes many people and teams out of the important work of building the business. Multiple individuals must try to find out where the problem is coming from, and ultimately what the root cause of the issue is. This involves many work hours to fix the problem and return to stability. After the incident is over, the team must manage the problem and determine how to best avoid it in the future.
The only way to eliminate MTTR is to be alerted of a potential problem before it happens. This involves accurately and quickly detecting anomalies, and quickly finding the root cause. In this case, the incident is not identified, but predicted. Because the root cause is known in advance, it can be fixed before the problem occurs.