All posts

Operational ROI for AI: Moving Past the Hype Metric

Venture-style ROI does not translate to enterprise AI. Measuring latency, error rates, decision cycles, and system resilience builds a case that survives scrutiny.

ROIevaluationobservability

Lauren Slyman's recent essay on AI's speed problem includes a provocation that should be pinned to every enterprise AI team's wall: "When ROI is defined through a VC lens, it becomes susceptible to incentives that, when applied to enterprises, are nonsensical and costly." (The Byte: AI Has a Speed Problem)

The venture definition of ROI assumes explosive growth or revenue acceleration based on a power-law distribution of returns. Most enterprise AI does not work that way. The returns are incremental: fewer manual handoffs, lower error rates, faster decision cycles, improved system resilience. Slyman notes these can appear in the "low single digits" and cannot be measured in isolation. They compound, but only if you build the infrastructure to measure them.

The measurement gap

Most AI products launch without measurement infrastructure. The team builds the model pipeline, ships the UI, and declares success based on user engagement or qualitative feedback. When leadership asks for ROI, the answer is either a vanity metric ("we handle 10,000 queries a month") or a proxy ("user satisfaction is 4.2 out of 5").

Neither metric tells you whether the system is getting better or worse. Neither helps you diagnose a regression. Neither survives a CFO asking "what did this save us in the last quarter?"

The problem is not that AI lacks ROI. It is that the ROI is operational, and most teams do not have the instrumentation to capture it.

What operational ROI looks like

Slyman's framework suggests measuring four dimensions: latency reduction, error prevention, improved decision cycles, and system resilience. Here is how each of these maps to concrete instrumentation in a system like ClariTrial.

Latency reduction

Every chat request and specialist run in ClariTrial creates a structured trace with parent-child spans recording duration, model, and status. The observability layer stores these in a ring buffer and exports them to a JSONL file and optionally to Arize Phoenix.

getAggregateMetrics() computes p50 and p95 latency across all traces in a time window. The quality monitoring API at /api/evals/quality returns these metrics alongside prompt version and model breakdowns.

When a prompt change or model swap increases p95 latency from 4 seconds to 7 seconds, the system surfaces it. Without this instrumentation, the regression shows up as vague user complaints weeks later.

Error prevention

The eval suite runs 23 labeled queries through router accuracy evaluation (with confusion matrix), code-based validators for specialist output and SQL safety, and optionally an LLM-as-judge with a weighted rubric. The experiment runner saves results as timestamped JSON, supports comparison between runs, and detects regressions.

This is not a "test suite" in the traditional sense. It is a quality regression detector. When a prompt version change causes the router to misclassify competitive-landscape queries as trial-discovery queries, the confusion matrix shows the shift before any user encounters it.

Decision cycles

The step-count distribution (p50 and p95) tracks how many tool calls the agent uses per query. A rising step count means the agent is working harder to reach answers, which often indicates a prompt regression or a tool-availability issue. A declining step count after a prompt improvement confirms the change had the intended effect.

User feedback (thumbs up/down per message) is correlated with prompt version and model ID in the audit log. A prompt version that drops the positive-feedback rate by 5% is a regression, even if the model's text quality looks fine to the developer.

System resilience

Source health monitoring tracks success/failure/rate-limit state per external source (ClinicalTrials.gov, AACT, PubMed, OpenFDA, WHO ICTRP). After three consecutive failures in one hour, a source is marked "down" and the status is injected into the system prompt so the agent can acknowledge the gap rather than silently omit results.

This is a resilience metric: how often are data sources degraded, and how does the system behave when they are? A system that silently drops a data source is less valuable than one that tells the user "PubMed is currently unreachable; results may be incomplete."

Compounding vs. moonshot

Slyman makes a distinction that maps precisely to evaluation strategy: "Organizations that chase hype metrics will jeopardize the compounding value that comes from patient and, often, boring process work."

An evaluation suite is boring process work. Writing 23 golden queries and labeling their expected routes is not exciting. Building a confusion matrix for router accuracy is not a feature anyone requests in a roadmap meeting. Setting up quality alerts that fire when the positive-feedback rate drops below a threshold is infrastructure that delivers value only when something goes wrong.

But this is where operational ROI lives. Each eval run that catches a regression before it reaches users is an error prevented. Each quality alert that surfaces a latency spike before users notice is a decision cycle shortened. Each experiment comparison that confirms a prompt improvement is a confidence signal for the next change.

The compound effect is that the system gets better in measurable, demonstrable increments. Not 10x better. Not "transformational." Just steadily, provably better, in ways that a CFO can read in a dashboard.

The bus stop ad problem

Slyman notes that the technologies that survive are the ones that can outlast their pitch. In AI product development, the pitch is the model demo: a fluent answer, a fast response, an impressive synthesis. The substance is the measurement layer underneath.

When an organization can show that its AI system reduced average research-question turnaround from 45 minutes to 12 minutes, with a 94% router-accuracy rate, a p95 latency of 5.2 seconds, and a positive-feedback rate of 78% (up from 71% after the last prompt iteration), that is not a bus stop ad. That is an operational case that justifies continued investment.

Building the infrastructure to produce those numbers is the work. Most of it is unglamorous. All of it compounds.