The Future Of AI-First IT Operations And How To Design Your Monitoring Strategy
Parenting toddlers is not for the faint of heart. If we intervened every time danger seemed imminent, we’d first insert them in Lysol-coated, hermetic plastic bubbles. We’d react the same way to dirt-eating and kidnapping. Imagine the chaos that would cause at play dates.
Thankfully, we cultivate instincts by observing the world and understanding what pattern of events indicates real danger. For example, stranger + park + unattended toddler = problem. However, relatives + home + unattended toddler = (relative) safety. Hundreds of times a day, we do mental math as parents to determine a toddler danger (we’ll call it TD) score. When it exceeds a certain threshold, we act.
Web-scale systems need to be monitored like toddlers. One sampled metric indicating high CPU usage doesn’t necessarily indicate danger. However, a thousand anomalous metrics and log lines from hosts associated with the same service over a one-minute period probably indicates a need to act.
Other Helpful Resources
Unified Intelligence Engine (UIE): A Technical Deep Dive Paper
InsightFinder utilizes the industry’s best unsupervised multivariate machine learning algorithms to analyze a large amount of production system data.
Root Cause Analysis
A major credit card company’s mobile payment service experienced severe performance degradation on a Friday afternoon.