Success Story

Root Cause Analysis

Helen Gu

21 Dec 2019
3 min read

Scenario

A major credit card company’s mobile payment service experienced severe performance degradation on a Friday afternoon. The initial problem triage did not reveal any issues. The company’s senior performance architect performed detailed investigations and found that some DB2 cluster nodes showed high disk usage; however, it was unclear to him whether the high disk usage came from a typical customer workload surge or some unusual activities. He had to make several painful bridge calls with different teams, including networking, DBA, storage, and infrastructure. The performance outage was not fully recovered until the next day.

Post-mortem

This performance outage was caused by human error. More specifically, one support engineer started a DB2 maintenance query at the wrong time. This triggered massive data movement operations in one of the customer’s DB2 clusters, and caused severe production performance degradation. After several long hours of bridge calls, the company’s senior performance architect was finally able to identify the root cause and recovered the production mobile pay system performance after stopping the maintenance query.

InsightFinder’s Solution

InsightFinder was brought into the customer’s environment as this problem occurred. The system operator used InsightFinder’s Metric File Replay and Log File Replay agent to quickly populate the disk, network, CPU usages, and system logs for all the production nodes consisting of two DB2 clusters and one GPFS system.

InsightFinder immediately revealed the anomalies detected in disk usages of those affected DB2 cluster nodes. By analyzing the DB2 logs, InsightFinder showed that a set of new pattern logs are identified on those nodes with high disk usage. The log messages are information-level logs that were not captured by the customer’s existing alerting tools. Those logs indicated a large number of data movement operations are triggered several minutes before the disk usage surge. InsightFinder provides comprehensive causal analysis over all collected metric and log data.

The customer could see the root cause of those high disk usage comes from those unusual data movement operations. Armed with the information provided by InsightFinder, the first line support engineer would have been able to correct the performance degradation by stopping those maintenance queries and would have avoided many hours of production performance degradation.

Results

This POC made the senior performance architect see the power of InsightFinder’s automatic anomaly detection and root cause analysis. The insights provided by InsightFinder helped this customer avoid significant financial losses and brand damage. Instead, the customer could provide consistent and reliable service for its users.

Learn more about InsightFinder Root Cause Analysis.

Learn more about InsightFinder with a free trial or live demo.

Contents

Helen Gu

Published: 21 Dec 2019
3 min read

See how InsightFinder helps your team deliver reliable services across every layer of the stack

Take InsightFinder AI for a no-obligation test drive. We’ll provide you with a detailed report on your outages to uncover what could have been prevented.

AI Reliability

IT Reliability

ARI

ARI Mobile

Unified Intelligence Engine - UIE

Integrations

Release Notes