When the Model Isn't Sure: Confidence Scoring and Calibration in Handwriting OCR — A Case Study
A practical follow-up to our OCR guide, drawn from a project that wrapped this month.
In our OCR guide we wrote that "a vendor reporting 96% accuracy without showing you the test set is bluffing." This post is the other side of that coin: what actually sits behind an accuracy number, and why a calibrated confidence score is worth more than a high but blind accuracy figure. We'll do it through a real case study from a project that finished this month — automated reading of Hebrew handwritten field forms.
The problem: not every detection is equal
The starting point was a set of 32 scanned images, each with Hebrew handwriting mixed with numbers, abbreviations and domain-specific terms. A single vision model returned 158 detections. At the macro level, "78% accuracy" sounds reasonable. But 78% accuracy across 158 items means 35 errors scattered through the set — and if you don't know which 35, you have to manually check all 158. The global accuracy saved nothing in review effort.
The client doesn't need "78% accuracy." They need to know which items can be trusted without checking and which require a human eye. That's not a question of average accuracy — it's a question of calibrated, per-item confidence.
The principle: dual-blind reading
The first move was to stop relying on a single model. We ran the same image through two independent vision models — one from the Gemini family, one from the Claude family — without either seeing the other's output. We call this dual-blind reading.
The idea is simple and powerful: when two models trained on different data, with different architectures, read the same scrawl and arrive at the same result, that agreement is an independent, strong signal of correctness. When they disagree, you get an automatic red flag without a human touching the page. Agreement between two independent readers is exactly what archivists have done by hand for a century — only now it runs at scale.
The confidence formula
Agreement alone isn't enough — two models can be wrong in the same way. So we built a composite confidence score per detection from three independent signals:
- Vision confidence (≈50%) — how sure each model is in the visual reading of the character itself.
- Dictionary support (≈40%) — whether the word matches a structured vocabulary (here, ~2,200 domain terms). A word that exists in the professional lexicon is reinforced; a random string of letters is penalized.
- Agreement bonus — a boost when the dual-blind reading converges on the same result.
The output isn't "right/wrong" but a continuous score, letting us bucket every detection into one of three tiers: high confidence (≥80%), medium, and low. In our set this split into 83 high-confidence items, 52 medium and 29 low.
The part nobody does: calibration
This is where most projects stop — they build a confidence score and assume it means something. It doesn't, until you calibrate it against ground truth.
Calibration means: take a manually transcribed set, and ask — of the items the model flagged "high confidence," how many are actually correct? When we checked, the initial "certain" category was right only 56% of the time. Nearly half of what the system declared "safe" was wrong. An uncalibrated confidence score is worse than no score — because it misleads.
The calibration itself didn't require training a new model. It required finding the right threshold: at what confidence level, and under what conditions, does the "certain" category actually hold its promise? Two tightenings made the difference:
- Requiring full agreement — only items both models read identically (edit distance zero) enter "certain."
- Raising the vision threshold — filtering out items where even a single model hesitated.
With the calibrated threshold, the "certain" category's precision rose to 90%, and an intermediate track to 84%. We didn't change the model. We only changed the definition of "certain" — turning a meaningless number into one you can build a workflow on.
Why this matters to the client
The practical difference is in human review effort:
| Approach | What the client gets | Manual review required |
|---|---|---|
| Single model, 78% global accuracy | One number, no idea where the errors are | All 158 items |
| Calibrated confidence, 90% "certain" threshold | 3 tiers prioritized by confidence | Mainly the 29 low-confidence items |
This turns a project from "check everything because you can't tell" into "check the 18% the system itself flags as doubtful." That's the difference between AI that creates work and AI that saves work.
Three lessons that transfer to any OCR project
- Global accuracy is a marketing metric, not an operational one. What actually drives cost is how many pages a human must review — and that depends on calibrated confidence, not the average percentage.
- Two independent weak readers are worth more than one strong reader. Agreement is a free signal. Disagreement is a free red flag. Use both.
- A confidence score without calibration is decoration. The only rule that matters: before you promise a client that "certain" means something, prove it against a manually transcribed set — and move the threshold until the promise holds.
Where to go from here
If you have a pile of documents and the question isn't only "can AI read this" but "which detections can we trust without checking each one" — that's exactly the work we do. A 30-minute scoping call, a representative set, and transparent calibration against ground truth — and together we'll know whether, and at what scale, it works for you.
