10/04/2026

From Archive to Reader: A Historical Newspapers Research Project with AI

One of our recent projects is a perfect illustration of the power of combining classical historical methodology with modern AI technology: tracking a specific historical figure's presence across decades of early Hebrew-language press.

The Challenge: A Needle in a Haystack of Crumbling Paper

The client — a researcher writing a biography of a public figure from the pre-state Yishuv era — needed to gather every mention of the subject across Hebrew newspapers between 1910 and 1948. The potential sources included:

  • HaTzvi and Herzl-era papers
  • Doar HaYom edited by Itamar Ben-Avi
  • Haaretz from the 1920s and 30s
  • Davar, the Histadrut organ
  • HaBoker, HaTzofe, Yediot Aharonot and more

Total material: thousands of issues, tens of thousands of pages. Manual search would have taken months, perhaps years.

Stage One: OCR Quality for Historical Type

National digitization projects exist (such as Israel's National Library "Historical Jewish Press"), but their OCR quality is limited — especially for:

  • Faded ink on crumbling issues
  • Complex page layouts with narrow columns, advertisements, mixed typefaces
  • Proper names and foreign transliterations, where OCR engines typically fail
  • Censored sections from the British Mandate era, where redaction marks confuse the recognizer

Our solution: we ran a supplementary custom OCR pass on the relevant issues, training the model on a corpus of Hebrew press from the same period. Accuracy on proper names jumped from 72% to approximately 94%.

Stage Two: Building a Smart Search Engine with RAG

Raw OCR text is not enough. A person's name can appear in dozens of variants: initials, nicknames, full or defective spellings, typos. And simple boolean search is insufficient — we want to understand the context of every mention.

The system we built:

  1. A vector database of all relevant paragraphs with half-page context in each direction
  2. A large language model agent that reads each "hit" and classifies it: is this the same person? Is this a passing mention or a full article about them?
  3. Automatic cross-referencing with other sources: personal diaries, minutes, letters — to verify identity based on dates and events
  4. Precise source citation: every result comes back with a direct link to the original scan, including issue number and page

The Results

Within three weeks — instead of a year and a half of manual search — we found:

  • Over 400 direct mentions of the subject in contemporary press
  • 17 full interviews and feature articles that were previously unknown, including a few buried in private collections
  • Two public controversies she was involved in — visible only through cross-reading multiple papers in the same week
  • A list of 30+ notable names who corresponded with her — identified by co-mentions in the press

What We Learned

This project taught us several principles we now apply to nearly every engagement:

  1. Quality OCR is the foundation for everything — there's no substitute for training on a period-relevant corpus
  2. Natural language beats keyword search — a modern LLM understands context like a human researcher, only a hundred times faster
  3. Source transparency is critical — every finding must be verifiable back to an original scan
  4. The human researcher is irreplaceable — AI accelerates collection, but historical interpretation remains ours

Right for You?

If you have a research project that involves large quantities of historical press, periodicals or journals — we're here to turn "impossible" into "three weeks of work."

Talk to us about your press research project