Scenario

A major customer-facing service from a cloud monitoring company was down for three days. Customers complained that they could not load the dashboard. The whole engineering team was pulled into the debugging process. The developers believed the service outage was caused by data corruption in their Cassandra data store. The team tried to recover the Cassandra cluster, but failed. Eventually, the company had to recreate the Cassandra cluster and perform data migrations between thousands of nodes to recover the service. The service was not fully recovered for weeks.

Post-mortem

The catastrophic service outage actually started several months earlier. Several machines experienced partial or complete disk failures. The system administrator actually received node failure alerts but decided to ignore the alerts because he believed it was not urgent to replace those failing nodes as Cassandra offers built-in fault tolerance via replication. Later on, more nodes failed and the workload on all nodes increased as a significant number of nodes experienced partial or complete disk failures. The system administrator decided to trigger a Cassandra configuration adjustment to address the problem. He triggered the migration by removing the failing nodes and throttled the workload on the nodes which are experiencing partial disk failures. Unfortunately, this configuration triggered a storm of data migration, which exacerbated the situation by making an already overloaded Cassandra cluster even more overloaded. About 48 hours later, the whole Cassandra cluster was brought down by high disk traffic.

Our Solution

InsightFinder was being brought into the environment as this problem occurred. Once installed, SREs used InsightFinder’s Metric File Replay agent to quickly populate the disk metric history of the environment. Almost immediately, InsightFinder identified a set of disk usage anomalies that correlated well with the configuration operation recorded in the post-mortem report. Those anomalies were missed by the customer’s existing monitoring tool and preceded the catastrophic failures by almost 48 hours. Our anomaly alerts clearly indicated an unusual disk usage spike right after the system administrator made the configuration error and the anomaly score reported by InsightFinder keeps increasing to a very high level when the whole system was down. Armed with this information, the system administrator would be able to correct the configuration error after receiving alerts and avoid the catastrophic service outage.

Results

This painful service outage proved to be educational for the customer. The customer quickly recognized that they could not only rely on their threshold-based alert system that cannot detect the unexpected misconfiguration behavior. Furthermore, the system administrator recognized that an early anomaly alert could save the company from big financial losses and brand damage.

InsightFinder proved to be capable of raising alerts before the system entered a catastrophic failure state and allowed the system administrator to quickly discover his configuration error.

Other Resources

A major credit card company’s mobile payment service experienced severe performance degradation on a Friday afternoon.