Scenario

Software reliability engineers (SREs) from a major search engine company were paged at midnight because their core internal database service was down indicated by the query failure rate key performance indicator (KPI). SREs only saw production queries were failing and there were no clear indications as to why. He guessed the reason and fixed one bug that turned out to be unrelated to the problem. The SRE had to page the developer to figure out the problem. The database service was down for over six hours and many products are affected by this service outage due to its high dependency degree.

Post-mortem

Hours after initially discovering the problem and a significant portion of queries failing, an insightful SRE decided to manually check other application-related metrics and found that there was a strong anti-correlation between the Application Connection Pool “Connections Available” metric and Query Failures; as Query Failures increased, Connections Available approached or flat-lined at 0.

With this additional information, the team identified a software bug that, when a workload executing a specific type of query was run, connections would be requested from the pool but never torn down and returned to the pool, starving the Connection Pool of connections to serve, and in turn causing queries to fail due to lack of available connections. The production service was recovered by disabling those workloads that issue the specific type of query.

Our Solution

InsightFinder was being brought into the environment as this problem occurred. Once installed, SREs used InsightFinder’s Metric File Replay agent to quickly populate the performance history of the environment. Almost immediately, InsightFinder identified that the Query Failures KPI had a strong causal relationship with the Available Connections metric. With notifications enabled, SREs would have received an initial notification that not only indicated that Query Failures were occurring, but that the Available Connections metric was experiencing anomalous increasing trend that hours preceded that KPI violation. Further, they would have received a web link directing them to a report showing a summary analysis of the anomaly event and a chart for both the Query Failures and Available Connections metrics. Armed with this information, SREs would be able to receive early warnings and immediately isolate the problem to connection use within the application to avoid the core.

Results

This regretful situation proved to be educational for the customer and motivating for the InsightFinder team. The SREs quickly recognized that they could only rely on their KPI as the “canary in a coal mine”; it only provides indication that a problem exists, and doesn’t help with an actionable operations plan to address it. Further, the SREs recognized that KPI-only monitoring was insufficient, but due to data volumes, no manual approach is viable due to varying workloads and maintainability of the solution.

InsightFinder proved to not only be capable of the “heavy-lifting” analysis in this case, but it’s the right long-term solution for automated analysis of this company’s mountain of metrics. And InsightFinder’s Causal Analysis and KPI Impact analysis visualizations are designed to help engineers quickly see the relationships between metrics, components, and their KPIs.

Explore how your team can benefit from InsightFinder by requesting a free trial or a free outage analysis.

Other Resources

A major credit card company’s mobile payment service experienced severe performance degradation on a Friday afternoon.
InsightFinder utilizes the industry’s best unsupervised multivariate machine learning algorithms to analyze a large amount of production system data.