The AI handed me a number that was wrong in the most convincing way

An automated tracker booked the pre-discount figure as revenue. Not garbage data, plausible data, which is why it survived. On why judgement, not speed, is the real leverage.

My bookkeeper caught something in a revenue overview I'd shared with her. A couple of invoices were showing more income than I'd actually billed. Not a rounding wobble — the figure was inflated by a real amount, and it had been sitting in my numbers looking perfectly legitimate.

The cause was small and almost elegant in how wrong it was. My revenue tracker pulls invoice data automatically from my time-tracking tool. When an invoice has a discount, that tool exposes several money fields: the subtotal before the discount, the discount amount, the tax, and the final total. My script had grabbed the subtotal — the figure before the rebate — and recorded it as revenue. So every discounted invoice quietly reported what I could have charged rather than what I actually did.

It was wrong in the most convincing way possible

This is the part worth sitting with. The number wasn't garbage. It was a real value, pulled from the right invoice, formatted correctly, landing in the right column. It was off by exactly the discount — the one slice of the total a glance would never question. If it had been wildly wrong, I'd have spotted it. Because it was plausibly wrong, it survived.

That's the failure mode of automated systems and AI tools generally, and there's a growing body of research on it. The work on verification in large language models describes exactly this: when these systems are wrong, they don't produce obvious nonsense, they produce a "plausible-looking alternative" that's hard to catch precisely because the rest of the output is fine. A single wrong figure hides comfortably inside a page of correct ones.

The dangerous error isn't the one that looks wrong. It's the one that looks exactly right.

And it's not a cheap problem at scale. Gartner's much-cited estimate is that poor data quality costs organisations an average of $12.9 million a year. Most of that isn't dramatic corruption. It's thousands of small, confident, slightly-off numbers feeding decisions nobody thought to re-check.

Two things made the difference, and neither was the software

The first was a human who knew what revenue means. To my bookkeeper, "you can only book what you actually invoiced" isn't a clever insight — it's the floor. She didn't need to see the code. She read the output against twenty years of knowing how the number is supposed to behave, and the discounted rows simply looked wrong to her. That's domain expertise doing the one thing automation can't: holding the result up against what reality requires.

The second was fixing the system, not just the cell. I corrected the two affected rows by hand, but the real repair was teaching the script to compute revenue as subtotal minus discount, and writing the rule — "record only what was actually invoiced" — into the tool's instructions so it can't drift back. A spreadsheet you patch is a chore. A rule you encode is a fix.

The leverage is in the checking, not the speed

I'm all in on letting AI and automation do the mechanical work — I've written about handing it two years of receipts, and the leverage there was never the speed. It was that I knew what the output was supposed to look like, so I could catch it when it drifted. This invoice bug is the same story from the other side: the automation will confidently hand you a number, and someone has to own whether that number is true.

It's the quieter cousin of the dashboard that lies by omission. There, the report leaves something out. Here, the report includes something it shouldn't. Both look complete. Both pass the glance test. Both need a person who knows the territory to say, "that's not right," and to know why — which is the whole argument for deep domain expertise being the real leverage rather than the tool.

The tool will give you an answer in milliseconds. Whether it's the right answer is still, stubbornly, your job.

Sources & further reading

External
Gartner — Data quality: why it matters and how to achieve it
arXiv — Chain-of-Verification reduces hallucination in large language models

Related posts
I let AI clear two years of receipts. The leverage wasn't speed.
Your analytics dashboard is lying to you by leaving things out
The death of generic AI: why deep domain expertise is the only real leverage left

Subscribe to Remco Livain

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe