When AI Agents Hallucinate: Why Deterministic Queries Come First | ClariTrial Blog

In January 2025 a hospital system discovered that its internal AI assistant had been citing clinical trial IDs that did not exist. The model had learned the format of NCT identifiers well enough to fabricate plausible ones, and nobody downstream caught it until a physician tried to look one up on ClinicalTrials.gov. The page returned a 404.

This is the hallucination problem, and in data-critical domains it is not academic. Fabricated trial counts, invented enrollment numbers, and phantom PubMed IDs erode trust in a way that is hard to recover from. Once a user catches one bad number, every future number is suspect.

Why it happens

Large language models are completion engines. They predict the next token, and when the most likely token sequence looks like a fact, they emit it regardless of whether it corresponds to reality. In creative writing this is a feature. In a system that returns clinical trial metadata it is a liability.

The standard mitigation is retrieval-augmented generation (RAG): give the model access to a search index and tell it to cite what it finds. RAG helps, but it does not eliminate the problem. The model can still interpolate between retrieved passages, invent statistics that "sound right" given the context, or silently drop the retrieval step when it is confident in its own knowledge.

Deterministic first, agentic second

ClariTrial takes a different approach. Instead of asking the model to retrieve and then narrate, the system routes every structured question through a deterministic query path before the agentic layer gets involved.

The architecture has two SQL access points:

Allowlisted presets (queryAactPreset): four curated queries (protein degrader, molecular glue, kinase inhibitor, combined recruiting) that return fixed schemas. The model picks which preset to call; it never writes SQL.
Validated parameterized queries (queryAactFlexible): the model specifies filters (target gene, sponsor, drug, condition, phase, modality) and the system builds safe parameterized SQL from them. The generated SQL is logged. No free-form model SQL reaches the database.

Beyond SQL, the system has typed API clients for ClinicalTrials.gov, PubMed, OpenFDA, and WHO ICTRP. Each returns structured JSON with provenance metadata attached at the source level.

The agentic chat layer sits on top. A lead model can delegate to specialist subagents (trial discovery, deep dive, evidence synthesis, trial comparison), but every specialist uses the same typed tool set. The system prompt explicitly forbids inventing NCT IDs, trial names, or results:

"Never invent NCT IDs or outcomes; specialists ground answers in tools."

When the AACT database is unreachable (missing credentials, network error), the tool returns ok: false and the orchestrator prompt instructs the lead to say so and fall back to ClinicalTrials.gov API search, not to guess.

What this means for other domains

The pattern is not specific to clinical trials. Any domain where structured databases already hold ground-truth data (financial transactions, supply chain records, legal filings, HR systems) can adopt the same layering:

Preset queries for known, high-frequency question shapes.
Validated parameterized queries for flexible but bounded access.
Agentic synthesis only after deterministic results are in hand, with the model told to cite what the tools returned and label anything else as interpretation.

The cost is that you need to enumerate your query surface. The benefit is that the most common and most dangerous failure mode of LLM agents, confidently presenting fabricated data as fact, is structurally prevented for the queries you have enumerated. The model can still hallucinate in its synthesis layer, but the raw data it works from is real.

The tradeoff

Deterministic-first is not zero-hallucination. The model can still misinterpret results, draw incorrect comparisons, or overstate implications. But the failure mode shifts from "the data is fake" to "the interpretation is wrong," which is a much easier problem for a human reviewer to catch, because the real data is right there in the same response, labeled as Facts.

That labeling is itself enforced: ClariTrial requires the model to structure replies with Facts, Summary, and Interpretation headings when it cites tool data, so a reader can tell at a glance which parts came from the database and which parts are the model's own reasoning.

Hallucination is not going to be solved by making models smarter. It is going to be solved by making systems that do not ask models to do things they are bad at. Looking up a row in a database is something a SQL query is good at. Summarizing what that row means for a researcher is something a language model is good at. The architecture should reflect that division.