All posts

When 40% Better Becomes 23% Worse: Task Gains vs. System Reliability

AI improves individual tasks but introduces new failure modes at the system level. Multi-agent architectures must account for the gap.

multi-agentreliabilityevaluation

A BCG study with researchers from Harvard, MIT Sloan, Wharton, and Warwick found that consultants using GPT-4 for creative product innovation performed 40% better than those working without it. In the same study, they underperformed by 23% on tasks involving complex business problem solving. (The Byte: AI Has a Speed Problem)

A GitHub Copilot study reported that developers completed tasks 55.8% faster with AI assistance. But as Lauren Slyman notes in her analysis, the study measured task completion in isolation, without accounting for downstream work: code review, debugging, integration, long-term maintenance. Anthropic's own research found that one AI-assisted group scored 17% lower than those who coded without help, with the largest gap appearing in debugging knowledge.

The pattern across studies is consistent: AI improves task-level performance but introduces new failure modes at the system level. This is not a model limitation. It is an architecture problem.

The task-system gap in multi-agent AI

Multi-agent systems exhibit the same pattern. A specialist agent that retrieves clinical trial data from ClinicalTrials.gov is fast and accurate at the task level. It searches the API, parses the response, and returns structured results. If you evaluate the specialist in isolation, it performs well.

But the specialist operates inside a system. The lead model decides when to call it. The system prompt shapes what question it receives. The results pass through a synthesis layer that combines them with outputs from other specialists. Each of these handoff points is a potential failure mode that does not exist at the task level.

In ClariTrial, we have observed the following system-level failure modes that do not appear in task-level evaluation:

Misrouting. The lead model sends a competitive-landscape question to the trial-discovery specialist, which returns trial data that technically answers the query but misses the comparative framing the user expected. The specialist did its job. The system failed.

Synthesis drift. Two specialists return accurate data that, when combined by the lead model, produces a misleading comparison. Each specialist's output is correct. The synthesis is not.

Confidence-accuracy mismatch. A specialist returns results with high surface confidence (fluent prose, specific numbers) that are drawn from a narrow or stale data slice. The system presents it as a definitive answer when it should present it as partial.

Structure collapse. Under complex multi-source queries, the model drops the Facts/Summary/Interpretation structure and writes a single flowing narrative that blends verified data with inference. The individual data points are accurate. The presentation makes them unverifiable.

How architecture addresses the gap

The BCG finding, that AI helps ideation but hurts complex problem-solving, suggests that the value of AI depends on where in the workflow it is applied and what guardrails surround it. The researchers found that participants over-relied on the model's output for analytical tasks, accepting plausible-sounding answers without sufficient scrutiny.

This maps to a specific architectural requirement: the system must not trust its own components uncritically.

Reflection loops address the confidence-accuracy mismatch. After each specialist completes, heuristic checks evaluate relevance, specificity, consistency, and data presence. A confidence score below 0.5 triggers a retry with diagnostic context. This catches the case where a specialist returned a technically valid response that does not actually support the claim the lead model is assembling.

Router evaluation addresses misrouting. The eval suite includes a 23-query golden dataset with expected routing labels. A confusion matrix tracks which query types are misrouted to which specialists. When a prompt change causes competitive-landscape queries to be classified as trial-discovery, the matrix catches it before any user encounters the error.

Structured debate addresses synthesis drift. For questions where multiple perspectives are likely to yield different conclusions, ClariTrial can run a debate protocol: three agents (optimistic, skeptical, balanced) argue positions through initial statements, a challenge round, and a judge synthesis. Confidence scores are aggregated by source tier and recency. The output is not a single narrative but a structured analysis that makes disagreements visible.

Post-response compliance checking addresses structure collapse. A regex-based scan verifies that tool-grounded responses include the required Facts/Summary/Interpretation headings. When the model skips them, a warning appears in the trace panel.

The Copilot lesson: faster is not better

The Copilot study's finding, that task completion was 55.8% faster, is meaningful only if faster task completion leads to faster, higher-quality system output. Slyman's point is that it often does not: "Faster coding does not necessarily make for faster delivery."

The same applies to AI agents. A specialist that returns results in 2 seconds instead of 5 is valuable only if the quality of those results supports the system-level synthesis. If the faster specialist produces output that requires more reflection retries, more synthesis corrections, or more user verification, the net system performance may be worse.

This is why ClariTrial's observability layer tracks latency at both the specialist level and the end-to-end level. A specialist optimization that reduces specialist latency by 40% but increases end-to-end p95 latency (because the lead model needs more steps to reconcile the output) is a regression, not an improvement.

Measurement at the system level

Slyman argues that "AI is accelerating outputs, but it is not yet accelerating outcomes." The distinction is crucial for evaluation.

Task-level metrics (specialist accuracy, response latency, tool-call success rate) measure outputs. System-level metrics (router accuracy, end-to-end latency, feedback rate, structure compliance rate, debate convergence) measure outcomes. An evaluation suite that only tracks the first set will show consistent improvement while system reliability degrades.

ClariTrial's eval suite operates at both levels. Code-based evaluators check specialist output quality. The router evaluator checks system-level routing accuracy. The LLM-as-judge evaluator scores end-to-end response quality with a weighted rubric. The convergence evaluator checks whether the system produces stable answers for similar queries.

The experiment runner allows before/after comparison across all evaluators, with regression detection. When a change improves specialist metrics but degrades system metrics, the comparison makes the tradeoff explicit.

The organizational readiness problem

Slyman's second practical takeaway is that "organizations must invest in readiness: data quality, governance, workflows, and system clarity." This applies to AI product teams as much as to the enterprises deploying AI products.

Building a multi-agent system without evaluation infrastructure is the engineering equivalent of deploying a microservices architecture without monitoring. Each service may work in isolation. The system-level behavior is unknowable until something breaks in production.

The readiness investment is in the measurement layer: golden datasets, router evaluations, quality monitors, experiment comparisons, structured feedback collection. It is boring work that does not appear in product demos. It is the infrastructure that determines whether the system improves over time or oscillates between good and bad without anyone understanding why.

The BCG study's 40%/23% split is a warning. Task-level gains are real. System-level reliability requires separate, dedicated engineering work. The organizations that understand this distinction will build AI systems that compound in value. The ones that do not will build systems that demo well and degrade in production.