When to Double-Check Your AI: A Calibration Card

I asked three frontier chats for “three independent bookshops in Chicago open this weekend.” All three picked real shops. All three searched the web before answering. ChatGPT and Claude both picked Sandmeyer’s in Printer’s Row, and on Sunday their hours matched the shop’s own website. Saturday, Claude was off by an hour: it said Sandmeyer’s opens at 9am; the shop’s own site says 10am. A reader showing up at 9am Saturday based on Claude finds the door still locked. ChatGPT got it right both days. Gemini picked different shops and got their hours right too.

That’s the shape of getting things wrong from AI in May 2026. The big strokes are mostly right. The specifics are where failures land. And the cost of being wrong depends on what you’re going to do with the answer.

The card below pairs eleven common tasks with how much to verify, so you know where to put your attention before you read what the chat hands back.

The card

Task	Glance	Spot-check	Verify every claim
Math you can check on a calculator	Top models nail basic arithmetic and grade-school-style word problems. A quick re-read or calculator catches the rare miss.
Summarizing a document you pasted	Grounded summarization is the most reliable common task. Skim the summary against the doc; that's the verification.
Translating between common languages	Common pairs (English ↔ Spanish, French, German, Italian, Portuguese, Chinese, Japanese) are publication-quality. Idioms and proper names are the rare miss.
Generating creative content (story, ad, poem)	The creativity is the deliverable. Verify the facts inside the fiction (a story set in 1920s Berlin still needs the period details checked).
Recommendations (books, restaurants, products)		Spot-check the names against Google Maps. Hours, current open status, and the occasional invented shop are where the load lands.
Coding small snippets		Run it. "Tests pass" doesn't mean the code is good; read the logic before you copy-paste into anything that matters.
Step-by-step instructions for physical tasks (recipes, repairs)		Reliable for everyday cases. The failure mode is the dangerous specific (food-safe temperatures, chemical combos, foraging IDs, ladder loads).
Dates, prices, and statistics			Models confidently round, mis-date, and freshen stale numbers. Treat any specific figure as a starting point for a search.
Quotes and citations			>60% of citations fail one or more checks across major products. Look up the quote and the cite before you forward, post, or paste into a brief.
Recall of obscure or specialist facts			Models score near zero on hard recall benchmarks. Confident-sounding answers about niche subjects are the most fabricated.
Medical, legal, or financial questions			The general information is often very good; the specifics (dosages, deadlines, dollar figures, current-as-of dates) are where models slip. Verify the specifics.

Glance

Math you can check on a calculator: Top models nail basic arithmetic and grade-school-style word problems. A quick re-read or calculator catches the rare miss.
Summarizing a document you pasted: Grounded summarization is the most reliable common task. Skim the summary against the doc; that's the verification.
Translating between common languages: Common pairs (English ↔ Spanish, French, German, Italian, Portuguese, Chinese, Japanese) are publication-quality. Idioms and proper names are the rare miss.
Generating creative content (story, ad, poem): The creativity is the deliverable. Verify the facts inside the fiction (a story set in 1920s Berlin still needs the period details checked).

Spot-check

Recommendations (books, restaurants, products): Spot-check the names against Google Maps. Hours, current open status, and the occasional invented shop are where the load lands.
Coding small snippets: Run it. "Tests pass" doesn't mean the code is good; read the logic before you copy-paste into anything that matters.
Step-by-step instructions for physical tasks (recipes, repairs): Reliable for everyday cases. The failure mode is the dangerous specific (food-safe temperatures, chemical combos, foraging IDs, ladder loads).

Verify every claim

Dates, prices, and statistics: Models confidently round, mis-date, and freshen stale numbers. Treat any specific figure as a starting point for a search.
Quotes and citations: >60% of citations fail one or more checks across major products. Look up the quote and the cite before you forward, post, or paste into a brief.
Recall of obscure or specialist facts: Models score near zero on hard recall benchmarks. Confident-sounding answers about niche subjects are the most fabricated.
Medical, legal, or financial questions: The general information is often very good; the specifics (dosages, deadlines, dollar figures, current-as-of dates) are where models slip. Verify the specifics.

Save or print this card for quick reference.

What sets the columns

Two things, mostly. The columns track how grounded the model’s answer is, and how much it costs you to be wrong.

Grounding does most of the work on accuracy. Vectara’s hallucination leaderboard puts top models between roughly 2% and 11% on grounded summarization, where the document is right there in the chat. Hand the same models 6,000 hard professional questions with no source provided (the Artificial Analysis AA-Omniscience benchmark did exactly that across 27 frontier models), and only three scored above zero on its reliability index. Same models, same question shapes; the presence or absence of a source you handed it does most of the work.

Cost-when-wrong does the rest. A wrong number that lands in a spreadsheet matters more than a wrong adjective in a story draft. That’s why “medical, legal, or financial” sits in “verify every claim” even though current models are genuinely good at general explanation in those domains. The information quality is high; the specifics (dosages, deadlines, citations, dollar figures) are where models slip, and the cost lands hard.

One small thing worth knowing about today’s chats: ChatGPT, Claude, and Gemini all search the web by default for questions that look fact-current, which shifts the failure mode. Even with search on, the specifics (hours, line numbers, prices) are where models slip. That’s why “Recommendations” sits at spot-check and not glance.

Two cells, demonstrated

A spot-check cell in action: the Sandmeyer’s hours. Same prompt I opened with, three chats, same morning. All three picked real, currently-open Chicago bookshops. ChatGPT and Claude both picked Sandmeyer’s in Printer’s Row; Gemini went with Exile in Bookville and The Book Cellar instead. ChatGPT confidently reported Sandmeyer’s hours as “Open Saturday and Sunday 10am–5pm,” which matches the shop’s own website. Claude reported “Open Saturday 9-5 and Sunday 10-5.” Same Sunday, but Saturday’s open-time was off by an hour. The reader who shows up at 9am Saturday based on Claude finds a locked door for the next hour; the reader using ChatGPT walks right in. That’s the spot-check moment in miniature: the picks were right, most of the hours were right, and one specific drifted by an amount that mattered.

I also ran the prompt with “without searching the web” prepended, fishing for a fabricated bookshop. None of the three fabricated one, but they handled the constraint three different ways. ChatGPT overrode it and searched anyway. Claude declined the task and offered to search instead: “I can’t reliably recommend three Chicago bookshops without searching, because store status changes (closures, relocations, hour changes) and I’d risk sending you somewhere that’s shut down or no longer exists.” Gemini honored the constraint, recommended three real long-standing shops, and added unprompted that since it wasn’t searching, “it is always a good idea to double-check their exact weekend hours…” Three flavors of calibration on a single sentence of instruction.

A verify-every-claim cell in action: a Hamlet quote about doubt. I asked each chat for “a quote from Hamlet about doubt, with the act, scene, and line numbers.” The Tow Center’s March 2025 study of eight AI search products found citation failure rates above 60%, which is why this row sits in the rightmost column. My run came out cleaner than that headline.

Claude and Gemini both delivered the canonical “Doubt thou the stars are fire” passage from the letter Polonius reads aloud in Act 2, Scene 2. Both cited it at lines 116–119. The Folger edition prints it at 115–118. Claude flagged that variance unprompted: “Note that line numbers can vary slightly between editions (Folger, Arden, Riverside, etc.), so you may see this cited as 2.2.115-118 or thereabouts depending on the source.” It pointed at the spot-check inside the answer. Gemini noted the variance too, less specifically.

ChatGPT did something else: it returned “One may smile, and smile, and be a villain” from Act 1, Scene 5. The line is real. The cite is plausible. But it’s a thematic stretch (the line’s about treachery), and the answer arrived with no hedge that this isn’t the canonical doubt quote a Shakespeare class would name. A reader pasting it into an essay on Hamlet’s doubt looks foolish at the first close-read. Verifying-every-claim means reading the quote for what it actually says, not just for whether it exists.

The verification moves

When the card says spot-check or harder, here are six moves that work inside any consumer chat:

Ask for the source. “Where did you get that?” If the model can’t cite a specific document or URL, treat the specific as a starting point for a search, not an answer.
Check whether search actually ran. A grounded response usually shows tool indicators (a “Searching…” line, source links under the answer, an embedded map widget). A confident answer with no markers is recall, and recall is where the specifics drift.
Paste the doc and ask about that. If you have the source, hand it to the chat. Grounded answers about a document in the conversation are roughly five-to-ten times more accurate than the same question asked cold.
Ask the model what it’s least sure of. “What in that answer are you least certain about?” Top models will often hand back exactly the specifics worth verifying. (For more on getting a model to confess uncertainty, see How to get AI to say “I don’t know”.)
Run it past a second model. Cross-checking ChatGPT against Claude or Gemini takes thirty seconds and surfaces the kind of variance the Sandmeyer’s hours showed. Where they agree, confidence goes up; where they diverge, look closer. The Sandmeyer’s example earlier is the model for this: ChatGPT and Claude almost agreed, but Claude’s Saturday open was an hour off, and that hour was enough.
Rerun in a fresh chat. Long conversations drift. If a specific really matters, paste the same question into a new tab without the rest of the context.

None of these require a paid tier, an API key, or a prompt-engineering tutorial. They take seconds and they pay back the time on the first thing you catch.

Half-life

The cells on this card will move. Hallucination rates trend down quarter over quarter; consumer chats are getting better at admitting what they don’t know; search defaults shift the failure modes. The categories are stabler. Grounded beats ungrounded. Common beats niche. Specific facts beat general info. Checkable beats vibey. Those will hold for a while.

If a task moves from your low-stakes column into your high-stakes one (the meal plan turns into a medication question; the casual restaurant pick turns into a reservation), revisit which column the card is putting you in. The card exists to point your skepticism at the places it pays, not to make you suspicious of everything. (For the underlying mechanics of why models fabricate confident specifics, see What’s actually happening when AI makes things up.)

An honest note before you go: while building the calibration card above, the first ground-truth pass got Sandmeyer’s hours wrong by trusting a Yelp listing instead of the shop’s own site. The fact-check that was supposed to catch that didn’t. Staging review did. The lesson and the failure are the same.

Drafted by Claude Opus 4.7 on May 22, 2026. Verified against live sources on May 22, 2026. If any of this has rotted, tell us.