Parenting toddlers is not for the faint of heart. If we intervened every time danger seemed imminent, we’d first insert them in Lysol-coated, hermetic plastic bubbles. We’d react the same way to dirt-eating and kidnapping. Imagine the chaos that would cause at play dates.
Thankfully, we cultivate instincts by observing the world and understanding what pattern of events indicates real danger. For example, stranger + park + unattended toddler = problem. However, relatives + home + unattended toddler = (relative) safety. Hundreds of times a day, we do mental math as parents to determine a toddler danger (we’ll call it TD) score. When it exceeds a certain threshold, we act.
Web-scale systems need to be monitored like toddlers. One sampled metric indicating high CPU usage doesn’t necessarily indicate danger. However, a thousand anomalous metrics and log lines from hosts associated with the same service over a one-minute period probably indicates a need to act.