Multivariate behavior learning – the key for observability and monitoring of distributed systems
In the Observability and monitoring space, anomaly detection is a key capability for every solution. But for IT Operations teams, problem detection is the (much more difficult) place where the value is created. Multivariate analysis unlocks the ability to manage complexity by taking into account how all the different software and hardware components that make up modern operational systems interact. So teams can predict and prevent system outages before they occur.
IT infrastructure is the backbone of nearly every enterprise company. Regardless of whether they are processing credit card payments, building tractors, or scheduling food deliveries, today’s corporations depend on reliable online systems to function. System availability, performance, and efficiency are critical.
A typical enterprise IT application system may include cloud and on-prem systems that span across multiple data centers and incorporate multiple software and hardware technologies. All of which generate infrastructure metrics, transaction logs, and system traces.
Definitions – what are anomalies?
In distributed systems, anomalies are deviations from normal operational system behavior. These can include a host of differing occurrences when a given metric rises above or below normal ranges. Examples include traffic spikes, throughput drops, API response time surge, CPU/Ram usage increase, Kubernetes pod number increase, error rate increase, traffic imbalance between data centers, and disk space usage increases.
The danger with identifying anomalies based on exceeding pre-defined thresholds is that
- these things happen frequently, and
- threshold overages often don’t impact system performance.
The monitoring team gets buried in false alarms – most of which don’t signal an actual problem. What to do?
Definitions – what’s a problem?
Problems are defined by ITSM simply as recurring incidents or major incidents that affect key system operations. These include things like service availability dropping , sudden workload backlog due to a software bug, or significant performance slowdown,. All of these start out as anomalies, but they impact performance, user experience, and business brand.
Making the move from Anomaly Detection to Problem Detection
Often problems are the result of multiple failures, or in interactions between systems that might otherwise not show up as anomalies. InsightFinder moves beyond single metric anomaly detection by applying unsupervised multivariate machine learning to massively distributed systems. InsightFinder was built on research that led to patents in how to predict problems within distributed systems.
Instead of just looking at individual measurements, InsightFinder takes a multivariate approach. Instead of just seeing a spike in one system, InsightFinder looks at how all the systems interact.
By learning what “normal” looks like in the system being monitored as a whole, InsightFinder recognizes regular activities – if a web server always has a CPU spike at 9AM on Mondays, then it isn’t an anomaly.
Additionally, one of InsightFinder’s strengths is root cause analysis. Because it understands what “normal” looks like, and it looks at all the component systems performance, InsightFinder’s root cause analysis capability gets to the heart of complex system problems.
To learn more about InsightFinder, try it yourself using the free InsightFinder Sandbox. Or schedule a consultation with an InsightFinder expert to see how multi-variate analysis of operational system data can impact your system performance.