A global query service from a web search company experienced a major service outage. The service consists of 13 different replication zones all over the world for reliability. The service outage was detected only after all 13 zones experienced unavailability. The Software reliability engineer (SRE) had no clue where the service outage started and which zones were attributed to the service outage. The system administrator had to manually check all the zones to localize the problem. The whole diagnosis process lasted more than two days, which caused significant financial loss to the company.
The problem started from one zone (let us called it zone X) which had a broken network switch that experienced route flapping. The problem caused a query backlog in the zone X. The multi-zoned service has global load balancing to adjust workload among different zones. When the workload backlog happened on zone X, the load balancing is supposed to shift the workload to other zones. However, this incident happened to trigger a software bug in the load balancing, which duplicated the workload from zone X and sent it to all the other zones instead of just shifting the workload to another zone. This eventually brought down the whole production service and their existing monitoring tool could only raise alerts and provided misleading information that all the zones were experiencing problems.
InsightFinder was being brought into the environment as this problem occurred. Once installed, SREs used InsightFinder’s Metric File Replay agent to quickly populate the operation history of the environment. Almost immediately, InsightFinder identified the culprit zone that experienced the high query error rate, correlating with a low network bandwidth metric, several days before the service outage happened. The system administrator would have received alerts from InsightFinder to localize the culprit zone and potentially prevent the whole service outage by fixing the switch flapping issue. Our multivariate anomaly detection accurately identified the culprit zone by analyzing all the zones together. In contrast, the customer’s existing tools either produced many false alarms or failed to detect the culprit zone anomalies.
This painful service outage proved to be educational for the customer. The customer quickly recognized that they could not rely solely on their univariate anomaly detection tools to localize the root cause in a geo-redundant service. Fast root cause identification, before a service outage, not only allows the system administrator to quickly resolve the issue but also provides the lead time necessary to prevent the disastrous service outage.
InsightFinder proved to be capable of identifying the “outlier” zone accurately before the system entered a catastrophic failure state and allowed the system administrator to fix the issue much more efficiently rather than searching for a needle in a haystack.