The best bug reports were written by the suspect

In a project I worked on — the order back office for a group of e-commerce shops — a rule engine decides which orders ship automatically. Invoice payment is the risky path — you ship goods on credit and hope the invoice gets paid — so the engine is deliberately paranoid: credit bureau distress, overdue invoices linked to the same identity, suspicious address patterns, too-high order values. Anything that trips a rule gets held for a human to review.

The humans were drowning. Most held orders are fine — a thin but honest young company, a branch office shipping to a non-registered address, a regular customer who happens to have an invoice due today. A reviewer opens the order, pokes through credit data and history for a few minutes, and approves. Repeat fifty times a day.

So we added an LLM second opinion. Not an LLM that ships orders — an advisory verdict (approve/reject), a confidence score, and a written rationale, rendered in a panel on the order page next to the rule engine's hold reasons. The reviewer still decides. A few weeks into the pilot, the most surprising outcome wasn't the triage speedup. It was that when the model was "wrong", about half the time our data was wrong — and the model's written rationale is what exposed it.

The shape of the thing

The pipeline is deliberately boring:

  1. The rule engine holds an order for credit risk.
  2. A facade assembles a de-identified feature payload — more on that below — and sends it with a versioned system prompt to a reasoning model (GPT-5.5, reasoning_effort: medium).
  3. The structured verdict is stored in an assessments table, stamped with model_id and prompt_version.
  4. The back office shows the verdict, the rationale, and a feedback thread where reviewers can agree or disagree.

Two non-negotiable properties:

It can never break shipping. The assess call is a total function — every failure mode, including a completely missing prompt, becomes a clean status='error' row instead of an exception. The inline trigger that fires at hold-time is wrapped in catch (\Throwable) and capped per batch. If OpenAI is down, orders hold and ship exactly as before; nobody gets paged.

Only de-identified data leaves the building. No names, no addresses, no national identifiers, no full emails or phone numbers. Postcodes become three-digit prefixes, emails become domains, identity matching happens server-side in SQL and only counts and sums go into the payload: "this phone number links to 4 other accounts with a 0.8 cancellation rate", never the phone number itself. The one later exception — internal staff notes on the customer — goes through a PII screen first, and was a documented decision because a human-written "warning, suspected fraud, block this account" note turned out to outweigh almost every numeric signal.

Prompts are config, not code

The system prompt started as a file in the repo. That lasted about two weeks. Every reviewer disagreement that turned into a policy tweak ("a huge shared-IP cluster is office NAT, not fraud") meant a code deploy to change English prose.

So the prompt moved into a database table — one row per version, a single active flag, an admin page to publish a new version. Every stored verdict stamps the prompt version it ran under, so "the model rejected this" is always answerable with which model, which prompt, which payload version. An hourly reconciler re-assesses held orders whenever the active version changes, so a prompt fix propagates to the queue within the hour, deploy-free.

If you build anything like this, version-stamp everything from day one. The triple (model, prompt_version, payload_version) is the only thing that makes feedback actionable later — "the model is too harsh on overdue invoices" is useless if you can't tell which prompt era the complaint is about.

Evals, and the fixtures that lie about the date

Every prompt change runs against a regression corpus before publishing: a few dozen real (PII-stripped) held orders with adjudicated expected verdicts, plus citation checks — a rejection for overdue debt must actually mention the overdue debt, whether phrased in English or Swedish.

The corpus policy that took longest to learn: eval inputs must track what production actually generates. When the payload format changes, we re-capture every case against the live assembler rather than freezing inputs at some historical version — a prompt scored on payloads production will never send is a comforting lie. If a re-captured case flips its verdict, that's investigated, not silently relabeled: an additive payload change flipping a grounded verdict is a prompt bug.

Except. Some cases exist precisely to pin time-relative state. An invoice that was "due today" at hold time is overdue a week later; re-capture quietly destroys the very thing the case tests. Those cases are marked frozen — the recapture tool refuses to touch them — and their payloads are truthful snapshots of hold-time state. When the payload format gains a field, frozen cases get it hand-added with time-stable values. A small lie about the date in service of a true test.

The model as adversarial QA

Here's the part I didn't expect. The reviewer-disagreement loop — humans flagging verdicts they think are wrong — kept resolving not into prompt fixes, but into bug discoveries in our own feature pipeline:

The off-by-one that called paying customers deadbeats. Reviewers flagged a cluster of rejections citing an overdue invoice where they saw none. The engine flags an invoice overdue when due_date < NOW() — but due dates are stored at midnight, so an invoice due today reads as overdue from 00:00 on its own due date. Hundreds of invoices sit in that state at any moment. The rule engine had this bug for years; nobody noticed, because the rule engine doesn't write a paragraph explaining itself. The LLM does, a reviewer read the paragraph, and the date handling in the model's feature pipeline was corrected within days — the rule engine's own fix is a bigger change and went into the queue.

Phantom credit exposure. Prepayment-required orders — where goods only ship after payment — were counted as outstanding credit exposure by an aggregation written long ago. An unpaid, unshipped prepayment order is by construction zero credit risk: no goods left the warehouse. The model kept citing alarming outstanding balances that, on inspection, were these phantoms.

The field that was always zero. The payload had carried an order.bonus field (loyalty-credit redemption) from early on. It read the wrong totals row class, so it had been silently 0 on every order ever assessed. Found only when reviewers flagged false rejects on negative-margin orders — the "loss" was customers spending earned loyalty credit, and the field that would have shown that was dead. Nobody audits a feature that never fires.

The loyalty loophole. Fixing that surfaced something better: loyalty credit became spendable when the earning order was invoiced — not when it was paid. So you could place an order on invoice, never pay it, and spend the credit it "earned" on the next one. A meaningful pile of spendable credit was backed by unpaid invoices. That's not a data bug; that's a business-rule hole that had been quietly farmable for years, found because a prompt iteration forced us to ask what the bonus number actually meant.

The pattern generalizes. A rule engine consumes features silently — garbage in, hold out, no one is the wiser. An LLM that must write a human-readable justification, reviewed by domain experts with a disagree button, is a continuous audit of the feature pipeline. The rationale text is the debugging surface.

Encoding judgment, not rules

The other lesson: the prompt converged into something closer to a policy document than a prompt. The interesting iterations were never "think step by step" — they were domain judgment a senior reviewer carries in their head:

  • A cluster of accounts sharing an IP is office NAT. Sharing a phone number across many organizations with a high cancellation rate is identity churn. The discriminator is the rate, not the cluster size.
  • A rock-bottom credit score with no payment remarks is a thin, young company, not a bust-out. The trigger is the remarks, not the score.
  • Card and other prepaid payment methods extend no credit — so credit-bureau distress is irrelevant there, and the only thing left to fear is a stolen card, which is a different fraud shape entirely.
  • A negative-margin order is a pricing problem, not fraud, and no amount of customer credibility overrides it — separate axes, separate verdicts.

Each of those started as a reviewer disagreement, got debated, and ended as two sentences in the prompt plus a frozen eval case to keep it from regressing. That loop — disagree, adjudicate, encode, pin — is the actual product. The model is just the part that scales it.

Takeaways

  • Advisory-only is a feature, not a cop-out. Because the model can't ship anything, we could put it in production early and let real reviewer disagreement drive iteration, instead of guessing in a sandbox.
  • Stamp (model, prompt_version, payload_version) on every verdict. It's the difference between feedback and noise.
  • Put the prompt in the database. Policy prose shouldn't need a deploy.
  • Track your eval inputs to production, and freeze the time-relative ones. Both halves matter.
  • The rationale is an audit log for your features. If an LLM keeps citing a number that humans say is wrong, believe the humans — then go find out why your pipeline produced it. Our best bug reports of the year were written by the suspect.