30/05/2026

OCR's Secret Weapon: How a Domain Lexicon Upgrades Text Recognition

Third in our OCR series — after the practitioner's guide and the case study on confidence scoring.

An OCR engine looks at a shape and guesses which character it is. When the script is clean, the guess is easy. When it's faded Hebrew handwriting, a rabbinical abbreviation, or a place name that has vanished from the map — the guess collapses. This is where the least-discussed, most-impactful tool comes in: the domain lexicon. This post explains what it is, how to build one, and how it turns a mediocre engine into an accurate system.

Why a good OCR engine isn't enough

Modern OCR engines ship with a general language model that helps them "guess right" from context. The problem: that general model was trained on everyday text, not on your material. It knows "shalom" is a common word — but it has never seen a rare surname, the nickname of an Eastern European shtetl, a halachic term, or a professional acronym. And it's precisely in those words — the rare, the unique — that the historical value lives, and precisely there that the general engine fails most.

A domain lexicon closes that gap. It hands the system a list of the words actually expected in the material — turning a blind guess into a choice among plausible candidates.

What goes into a domain lexicon

A domain lexicon isn't a "dictionary" in the ordinary sense. It's a focused collection of the linguistic units that characterize one specific archive:

  • First and family names — including spelling variants and transliteration corruptions.
  • Place names — towns, streets, districts, sometimes under historical names no longer in use.
  • Professional terms — halachic, medical, legal, engineering — depending on the collection type.
  • Abbreviations and acronyms — extremely common in Hebrew manuscripts and a frequent source of confusion for general engines.
  • Fixed patterns — Hebrew dates, opening and closing formulas, units of measure.

In our projects a working lexicon runs from a few hundred to a few thousand entries. Even a lexicon of ~2,000 terms covers most of the hard cases in a focused collection.

How to build a domain lexicon

You don't invent the lexicon — you collect it from sources that already exist:

  1. Existing indexes and lists — book indexes, membership rolls, previously transcribed community ledgers.
  2. Ground-truth transcription — the first few hundred lines you transcribed by hand already contain the core vocabulary.
  3. Expert knowledge — a historian or archivist who knows the collection knows which terms and names recur.
  4. External sources — name databases, geographic gazetteers, professional glossaries.

Once collected, you normalize: merge spellings, mark variants, and attach context to each entry (type, expected frequency). A good lexicon is alive — it grows as you transcribe more material.

How the lexicon fits the pipeline

The lexicon doesn't replace the engine — it's a layer on top of it, in three places:

  • Candidate re-ranking — the engine produces several possible readings per word; the lexicon prefers the one matching a known entry.
  • Post-OCR correction — a word very close (small edit distance) to a lexicon entry is corrected to it, carefully.
  • Confidence scoring — as explained in the previous case study, a lexicon match is ~40% of the confidence score. A word that exists in the lexicon is reinforced; a random string is penalized.

The critical distinction: the lexicon influences the probability, it doesn't force a result. That's exactly where projects fail.

The trap: when a lexicon hurts

A domain lexicon is a double-edged sword. If you force every word to match its nearest entry, the system "corrects" correct text into an error — a phenomenon called over-correction. A real surname not in the lexicon gets wrongly swapped for a similar one that is. A lexicon meant to raise accuracy quietly lowers it instead.

The rule: the lexicon suggests, it doesn't decide. You auto-correct only when visual confidence is high and the match is very close. Everything else goes to a human review queue instead of being force-corrected. Flagging "unsure" beats inventing an answer.

How much it really helps

On clean material the effect is modest — the engine manages on its own. On the hard material, where the value hides, the effect is dramatic. On a typical Hebrew handwriting collection, adding a focused domain lexicon improved our exact-match rate on the critical words by tens of percent — sometimes nearly doubling it — without retraining the engine and without a single extra image. It's the cheapest upgrade-to-benefit ratio in the entire pipeline.

The reason is simple: training a model demands GPU and time; a lexicon demands a list. And in most historical collections that list already partly exists — it just needs collecting.

Where to go from here

If you have a collection with a unique vocabulary — names, places, professional terms — you very likely already hold half a lexicon without knowing it. In a short scoping call we'll examine the material, identify which sources can become a lexicon, and estimate how much accuracy it would add to your project — before you spend a shekel on training a model.