On Friday July 19 2024, Crowdstrike, a leading cybersecurity firm, experienced a widespread outage that affected its customers worldwide. The incident was caused by a software update that was rolled out at 4:09am UTC, but it wasn’t until 4:27am UTC that InsightFinder, a leading AI-powered Observability platform, started detecting widespread machine crash incidents in our customers. In this article, we’ll explore how InsightFinder’s real-time monitoring and analytics capabilities helped our customers mitigate the impact of the outage and take proactive steps to minimize downtime.

The Incident:

At 4:09am UTC, Crowdstrike rolled out an update to its Falcon platform, which is used by thousands of organizations worldwide to detect and prevent cyber threats. However, the update includes an out-of-bounds memory read bug, which caused widespread Windows operating system crashes affecting the availability of critical services and applications at global scale. 

InsightFinder’s Detection:

InsightFinder Unified Intelligence Engine (UIE) detected the anomaly  incident as early as 4:27am UTC, long before Crowdstrike officially acknowledged the issue. Our platform uses machine learning algorithms to analyze vast amounts of telemetry and log data from many sources, including system logs, network traffic, and performance metrics. Particularly, InsightFinder UIE detected a large number of crashing events including  PAGE_FAULT_IN_NONPAGED_AREA and SYSTEM_THREAD_EXCEPTION_NOT_HANDLED. This allows us to identify anomalies and patterns that may indicate potential issues before they cause major service outages for our customers.

Benefits to Customers:

InsightFinder’s early detection of the incident provided our customers with a critical window of opportunity to take proactive steps to minimize the impact of the outage. With our real-time insights, customers were able to:

  1. Identify affected systems: Our customers were able to quickly identify which systems and applications were affected by the outage, allowing them to prioritize their response efforts.
  2. Communicate with stakeholders: Our customers were able to communicate with their stakeholders, including employees, customers, and partners, about the incident and the expected resolution time, reducing the risk of misinformation and panic.
  3. Reduce downtime: By taking proactive steps, our customers were able to reduce the downtime associated with the outage, minimizing the impact on their business operations and customer satisfaction.

Lenovo Director Coby Gurr highlights the benefits of leveraging InsightFinder’s platform:  “Partnering with InsightFinder gives us an innovative edge in proactive insights and digital employee experience (DEX). Their technology enhances Lenovo Device Intelligence, ensuring our customers enjoy uninterrupted excellence and reliability.”

The Crowdstrike outage was a significant incident that highlighted the importance of real-time monitoring and analytics in IT operations. InsightFinder’s advanced unsupervised machine learning algorithms allowed our customers to detect the incident early, take proactive measures to mitigate its impact, and reduce downtime. By leveraging InsightFinder’s insights, our customers were able to minimize the disruption caused by the outage and maintain business continuity. As the IT landscape continues to evolve, it’s clear that real-time monitoring and analytics will play a critical role in ensuring the reliability and availability of critical systems and applications.

 

Other Resources

Our unified Kubernetes collector gathers metrics, logs, traces, and events in real-time from a single aggregation point. KubeInsight leverages all

Observe your entire IT system health in real-time with one central view across all services, applications, and infrastructure. Catch production

Deploy our purpose-built AI platform to empower you and your teams with hours of advance notice. See how it works

The Unified Intelligence Engine (UIE) delivers anomaly detection, root cause analysis, and incident prediction for Enterprise scale ML/LLM models, infrastructure

A major credit card company’s mobile payment service experienced severe performance degradation on a Friday afternoon.