It was one week before Kazzle was launching its new product. The company had dominant market share in the astronomy community and was known worldwide for its optical filters for telescopes. With Kazzle, views of space could be augmented with virtual worlds developed on a SaaS platform. The new product extends the optical filter technology to underwater images taken from waterproof smartphones. 

The new app performed well with 2,000 beta testers. Karl, head of SRE, had a near coronary when marketing told him to expect 200,000 daily active users growing 10% per month when the new product launched. Karl was aware of every vulnerability in the architecture – Apache hadn’t been tuned properly, storage capacity was under-scoped, verbose logs were causing CPU spikes, Kubernetes wasn’t configured to auto-scale. Karl had been part of organizations like Kazzle before where sales and marketing assumed infrastructure scales itself. In reality, the world of SRE leaders is less like a symphony – every note rehearsed – and more like a hurricane – lightning and thunder, tsunami waves, uprooted trees.

Karl convened his team for daily standups in the week prior to launch. Even with a team of experts assigned to monitor every aspect of the tech stack, he knew Kazzle infrastructure could never anticipate what would go wrong. Karl’s strategy was to rely on his team for triage plus a SaaS AIOps platform for monitoring and remediation. He realized he needed a system that would detect anomalies in logs and metrics at cloud scale. He needed to automate the process of using anomalies to predict when business impacting incidents might occur. He needed a system capable of alerting team members about the probable root cause of incidents. He needed an unfair advantage.

Karl was appropriately paranoid. He had been at Kazzle two years and was the only person left on the DevOps team from when he joined. His manager Radhika was quickly “reassigned” after a critical outage stymied last quarter’s product launch. Karl was savvy enough to evaluate AIOps platforms instead of blowing his limited budget on headcount. He evaluated Splunk, New Relic, ServiceNow, and a few emerging AIOps tools before investing in InsightFinder. It was the only platform that met his requirements for anomaly detection, incident prediction, and root cause analysis – all integrated with his other tools, all able to operate on a SaaS platform at Kazzle’s scale.

The launch was successful. Miraculously, no outages or performance issues. Instead of being fired, Karl was promoted to VP. Most important, 220,000 new Kazzle customers can now share sepia-tinted coral reefs with mermaids in 3D on Instagram. 


Learn how InsightFinder can help your team scale here….

Other Resources

A major credit card company’s mobile payment service experienced severe performance degradation on a Friday afternoon.
InsightFinder utilizes the industry’s best unsupervised multivariate machine learning algorithms to analyze a large amount of production system data.