Building an Audit Trail for AI: From Portfolio Demo to Regulated Research | ClariTrial Blog

When a human analyst produces a report, the organization keeps records: who wrote it, what data they accessed, when it was reviewed, and who approved it. When an AI agent produces the same report, most systems keep nothing. The prompt is ephemeral. The model version is unrecorded. The tool calls are invisible. If a regulator asks "how did the system reach this conclusion six months ago," the answer is often "we do not know."

This is the audit-trail problem, and it matters in any domain where decisions have consequences: healthcare, finance, legal, government. It also matters in a simpler sense for engineering teams that need to debug a regression: "the model used to give good answers to this question, and now it does not. What changed?"

What ClariTrial logs

Every chat completion in ClariTrial writes a structured event to a JSONL audit file. Each event includes:

Timestamp: when the response was generated.
Prompt version (PROMPT_VERSION): a monotonically increasing version number, bumped whenever system prompts or orchestration logic change materially. This is the single most important field for reproducibility.
Model ID: which model (provider and version) generated the response.
Step count: how many tool-call steps the lead model used.
Tool trace: the sequence of tool names and truncated input previews, including nested specialist tool calls.
User prompt fingerprint: a SHA-256 hash (truncated to 16 hex characters) of the user's input. The raw prompt is not stored in the audit log, preserving user privacy while enabling duplicate detection.
Product scope: whether the deployment is running in portfolio_demo or regulated_research posture.
Intent mode: what the user selected (Auto, Facts & SQL, Explore).

The file path defaults to .data/chat-audit.jsonl and is configurable via CHAT_AUDIT_LOG_PATH.

Per-message feedback

Users can attach thumbs-up or thumbs-down feedback to any assistant message. This feedback is appended to the same audit file as a separate event type (type: "feedback"), linked to the message ID and thread ID. Over time, this creates a quality signal that can be correlated with prompt versions, model IDs, and tool traces to identify regressions.

The activity timeline

Authenticated users can view /chat/activity, a timeline that shows which tools and specialists the agent used in each conversation. This is a user-facing transparency feature, not a compliance artifact, but it serves a related purpose: it lets users review what the agent did across sessions, catch patterns (e.g., "the agent always delegates to trial_discovery for this type of question"), and report concerns.

Prompt versioning as a change-management tool

PROMPT_VERSION is the key to the audit trail. When a developer changes the system prompt, the orchestrator instructions, a specialist prompt, or the answer-typing rules, they bump the version number. The version is logged with every completion.

This means that if answer quality changes, the first debugging step is to filter the audit log by prompt version and compare. If the regression appeared at version N, the diff between version N-1 and N is the first place to look.

Without prompt versioning, debugging a regression in an agentic system is like debugging a code change without version control: you know something is different, but you do not know what.

Scope tagging: a bridge to regulation

ClariTrial runs in portfolio_demo scope by default. When ARCHITECTURE_PRODUCT_SCOPE=regulated_research is set, every audit event is tagged with the stricter posture. Today, this is a metadata label. The system does not yet enforce different behavior (no RBAC, no human-review gates, no verification steps between tool results and narrative).

But the label is the foundation. An organization adopting this pattern for regulated use would:

Start by deploying in regulated_research scope so all events are tagged.
Build dashboards over the JSONL audit log to monitor tool usage, model performance, and feedback trends.
Add RBAC rules that restrict which tools are available to which user roles.
Add a human-review queue for responses flagged by the draft-analysis heuristic.
Add verification gates that require a second model or a human to confirm tool-result-to-narrative fidelity before the response is shown.

Each step builds on the audit infrastructure that already exists. The difference between a portfolio demo and a regulated deployment is not a rewrite; it is a series of enforcement layers added on top of the same event stream.

The template for AI compliance

The pattern is generalizable:

Log structured events for every AI-generated output. Include the prompt version, model version, tool trace, and a privacy-safe input fingerprint.
Version your prompts the way you version your code. Make the version number a first-class field in every log event.
Collect user feedback alongside completion events so quality signals are correlated with system configuration.
Tag deployment posture so the same codebase can run in demo and regulated modes, with the audit trail reflecting which mode was active.
Plan enforcement as additive layers (RBAC, review queues, verification gates) that read from the audit stream, rather than as rewrites of the core system.

Most organizations building AI products today skip the audit trail because they are focused on getting the model to produce good answers. By the time regulation or compliance requirements arrive, they have months of unlogged outputs and no way to reproduce past behavior. Starting with the audit infrastructure, even in a demo deployment, means the path to compliance is incremental rather than catastrophic.