Blogs

InsightFinder 2025 Retrospective: From Observability Insights to Operational Actions

Theresa Potratz

3 Mar 2026
10 min read

AI Reliability Platform uniting through composite AI

In 2025, the reliability bar kept moving forward. Engineering teams shipped more distributed systems, but they also shipped more probabilistic behavior: AI features that adapt, depend on fast-moving data, and increasingly act through tools. Traditional observability still helps teams find where the fire is. The harder question is what changed, why it changed, and what the safest next step looks like when the system’s behavior is not deterministic.

That’s the backdrop for InsightFinder AI’s 2025 story. Over the last year, we sharpened our focus on a unified reliability platform that supports both enterprise IT services and agentic AI systems. Insights continue to matter, but teams win when insights reliably translate into safer operational actions.

Innovation in AI Reliability: Composite AI and Patented IP

Most “AI observability” narratives still over-index on surface-level automation. The real innovation shows up when your systems can preserve causal context across noisy signals, then turn that context into decisional guidance under pressure. That’s the problem we focused on in 2025, with foundational work that reinforces technical leadership rather than incremental features.

A core innovation pillar is our Composite AI approach, designed to combine multiple techniques so detection and diagnosis remain stable when data is messy, environments shift, and failure modes don’t neatly repeat. Composite AI matters in production because a single method rarely holds up across the full set of reliability scenarios, especially when AI workflows add tool calls, retrieval layers, and multi-step execution paths.

In parallel, InsightFinder was also granted a new patent tied to automated incident prevention. Patents don’t replace execution, but they do serve as an external validation that our underlying methods are defensible and novel. For enterprise buyers, this kind of IP matters because it shows our differentiation isn’t just packaging—it’s rooted in technical work that cannot be replicated.

Our innovation tone for the year was simple: we’re betting on reliability outcomes that hold up in real systems, not only in controlled demos.

Flagship Product Releases: Tracing for Agentic Systems and the ARI Reliability Agent

In 2025, two product narratives stood out as the clearest signals of where modern reliability is going: distributed tracing built for agentic AI workflows, and the ARI reliability agent designed to help responders move from signals to grounded next steps.

Distributed Tracing for Agentic AI Workflows

Many teams learn painful lessons as AI features move into production: logs don’t explain LLM incidents. A single user request can fan out into retrieval, tool calls, internal services, and multiple model invocations. When latency spikes, cost surges, or output quality degrades, responders need to reconstruct the story of execution. Without traces, teams can only guess what’s wrong. And guessing is expensive.

Our distributed tracing investments in 2025 squarely positioned traces as a primary debugging primitive for AI systems. The point isn’t just to see spans; it’s preserving causality across a multi-step, multi-hop workflow so teams can answer questions that matter during incidents.

Responders need to know whether a slowdown came from retrieval, a tool call, a downstream service dependency, a model provider change, or a prompt-path shift. They also need to understand which step changed, which users were impacted, and how the failure propagated. Tracing becomes the connective tissue that turns fragmented telemetry into a timeline engineers can trust.

This is especially true for agent systems, where execution paths can vary per request. Traditional monitoring assumes the same call graph repeats. Agents break that assumption. Tracing is one of the few observability primitives that remains reliable even when your AI workflows are dynamic.

ARI: The Operational Reliability Agent

ARI, the Autonomous Reliability Insights agent, is where we made the year’s biggest statement about operational direction.

ARI is an operational partner that works inside your reliability workflows. ARI’s goal isn’t to replace engineers or to turn incident response into a chatbot experience. Rather, ARI reduces the mechanical load that repeats in every incident: gathering evidence, summarizing what changed, mapping symptoms to likely causes, and identifying the safest investigative next step.

ARI stays grounded in system evidence. Reliability teams don’t need more plausible narratives. They need decisions anchored in telemetry, topology, and change context. ARI’s role is to compress time-to-understanding by connecting signals to the evidence that justifies a conclusion.

MCP as the Gateway Layer Under ARI

A key part of ARI’s story is interoperability. In 2025, we introduced our MCP server as a gateway between AI and observability. The important point is architectural: ARI is built on top of MCP, which makes it easier to integrate into existing toolchains and workflows rather than forcing teams to rebuild around a single vendor experience.

For enterprise buyers, that matters. Reliability organizations already have established incident management, on-call practices, and observability stacks. Tools that require a rip-and-replace approach tend to stall. ARI’s MCP foundation signals a bias toward fitting into the systems teams already run.

Dependency Graph Capabilities as ARI’s Operational Backbone

We also advanced our dynamic dependency graph capabilities (often described as a “service map”). But, more important, is what those dependencies enable when paired with a reliability agent.

Static diagrams go stale quickly. Modern architectures change too fast, and agentic systems add variability in execution paths. A living dependency graph becomes valuable because it supports what responders actually need in the moment: blast radius awareness, service-to-service context, and causal hypotheses that reflect real relationships rather than outdated documentation.

Instead of treating “service map” as a standalone feature, we leveraged dependency discovery as a backbone for ARI’s investigations. Even when the primary goal isn’t a visual map in a UI, the underlying graph can still drive smarter analysis: root cause isolation, targeted probing, and more reliable reasoning about what changed and where to look next.

That’s the combined narrative: tracing to preserve the execution story, ARI to turn that story into decisional guidance, MCP to integrate ARI into the existing stack, and a dynamic dependency graph so investigations reflect real system relationships.

Other Supporting Improvements

InsightFinder also expanded supporting capabilities across the year (drift detection, model fine-tuning, policy guardrails, etc), but the most important takeaway is how those improvements reinforced our core themes: compress the path from detection, to safe action, to better learning.

In 2025, our platform direction showed a clear bias toward operational outcomes that reliability teams can trust during real incidents.

Customer Momentum

Product velocity only matters when it maps to real adoption. In 2025, our customer momentum became more concrete through expansions and production-scale deployments that reflect how reliability teams are operating today.

Lenovo saw a substantial 10X increase in usage over the year, signaling growing reliance on InsightFinder in day-to-day operations across 2 million devices.

Comcast NBCUniversal expanded its engagement, reflecting continued investment in reliability capabilities that work across complex enterprise environments.

AccessParks also expanded, serving as a clear indicator of platform fit as requirements scale and operational expectations rise. Doubling expansion to cover nearly half a million devices, AccessParks demonstrated scaling demand rather than static deployment. In a world where observability tools often plateau after initial rollout, expansion at that magnitude is a practical signal that the platform is delivering operational value to multiple teams.

UBS became a key customer, with a focus on data quality and data integrity for mission-critical trading data. Data quality failures are rarely loud at first. Instead, they quietly corrupt downstream analytics, trigger bad decisions, and create risk that surfaces later when it is harder to unwind. Data integrity is an operational reliability problem, not a periodic audit task. It’s increasingly common in AI-adjacent systems where models amplify upstream errors.

We’re seeing customers expand their adoption footprint: starting with reliability across their traditional infrastructure, then moving toward AI reliability: all under the same platform. The combined signal across these customers reflects a broader market pattern: reliability programs are widening beyond classic infrastructure telemetry to include dynamic execution paths, tool-mediated behavior, and data quality as a first-class operational concern.

Team Growth: Investing in Go-To-Market

A platform that aims to move from insight to action needs more than engineering throughput. It needs stronger go-to-market execution, clearer product storytelling, and tighter feedback loops with customers who run high-stakes systems.

In 2025, we expanded our team with key hires across marketing, sales, and growth functions, including Theresa Potratz, Dan Braun, and George Miranda. InsightFinder can’t scale customer impact by shipping new features alone. We scale by building a GTM machine that serves the needs and operational realities of today’s enterprise environments.

Ecosystem Windfalls

There’s an unmistakable signal coming from the broader market: investors are pouring capital into AI reliability, AI observability, and AI for production operations. Over the past year, the space has seen repeated headlines around oversized venture rounds and eye-catching valuation marks—often for companies that are still early in customer adoption. The takeaway isn’t that the category is overhyped; it’s that the market is pricing in the inevitability of reliability pain as AI systems move from interesting demos to critical-revenue workflows. In other words, investors are treating AI reliability as a foundational layer of the modern stack, even if most vendors are still trying to prove their value in production.

That’s exactly where InsightFinder’s experience matters. We’re not arriving in a newly funded category with a point solution—we’re a seasoned reliability platform built for your entire stack, covering both traditional deterministic apps and infrastructure and non-deterministic AI applications and agentic workflows. That dual coverage is increasingly non-negotiable in enterprise environments where classic microservices, IT services, and AI-driven experiences coexist on the same critical path. And because InsightFinder’s detection and diagnosis are powered by Composite AI techniques, we deliver a best-in-class experience for detecting, proactively preventing, mitigating, and explaining anomalies—including the subtle failure modes other tools miss when signals are noisy, workflows are dynamic, and “normal” shifts underneath you.

What 2025 Signals for Reliability Leaders

For 2025, the most important takeaway from InsightFinder strategy is an operational thesis.

Modern systems fail in ways that are harder to reason about. Agentic AI introduces variable execution paths. Distributed systems evolve faster than static documentation. Data changes quietly until it becomes a business incident. In that environment, reliability teams need tooling that preserves causal evidence, shortens investigations, and supports safer decisions under pressure.

Our developments in 2025 align tightly with that thesis: innovation that holds up in production, tracing designed for dynamic agent workflows, and ARI as a reliability agent built to integrate into existing stacks and reason over real dependency context. When you manage complex stacks that use traditional deterministic services alongside today’s new AI non-deterministic system, InsightFinder AI has your reliability needs covered.

See Our Platform Direction Against Your Own Reliability Needs

If your team runs distributed services, LLM applications, or agentic workflows in production, the question is no longer whether you have observability. The question is whether your tooling can keep up with probabilistic behavior, changing execution paths, and the operational need to turn signals into actions safely.

To see what InsightFinder’s 2025 capabilities look like against your own workloads, request a demo so we can walk through your highest risk reliability scenarios.

Contents

Theresa Potratz

Published: 3 Mar 2026
10 min read

See how InsightFinder helps your team deliver reliable services across every layer of the stack

Take InsightFinder AI for a no-obligation test drive. We’ll provide you with a detailed report on your outages to uncover what could have been prevented.

ARI

IT Reliability

AI Reliability

Unified Intelligence Engine - UIE

Integrations

Release Notes