From Archive to Reader: A Historical Newspapers Research Project with AI
One of our recent projects is a perfect illustration of the power of combining classical historical methodology with modern AI technology: tracking a specific historical figure's presence across decades of early Hebrew-language press.
The Challenge: A Needle in a Haystack of Crumbling Paper
The client — a researcher writing a biography of a public figure from the pre-state Yishuv era — needed to gather every mention of the subject across Hebrew newspapers between 1910 and 1948. The potential sources included:
- HaTzvi and Herzl-era papers
- Doar HaYom edited by Itamar Ben-Avi
- Haaretz from the 1920s and 30s
- Davar, the Histadrut organ
- HaBoker, HaTzofe, Yediot Aharonot and more
Total material: thousands of issues, tens of thousands of pages. Manual search would have taken months, perhaps years.
Stage One: OCR Quality for Historical Type
National digitization projects exist (such as Israel's National Library "Historical Jewish Press"), but their OCR quality is limited — especially for:
- Faded ink on crumbling issues
- Complex page layouts with narrow columns, advertisements, mixed typefaces
- Proper names and foreign transliterations, where OCR engines typically fail
- Censored sections from the British Mandate era, where redaction marks confuse the recognizer
Our solution: we ran a supplementary custom OCR pass on the relevant issues, training the model on a corpus of Hebrew press from the same period. Accuracy on proper names jumped from 72% to approximately 94%.
Stage Two: Building a Smart Search Engine with RAG
Raw OCR text is not enough. A person's name can appear in dozens of variants: initials, nicknames, full or defective spellings, typos. And simple boolean search is insufficient — we want to understand the context of every mention.
The system we built:
- A vector database of all relevant paragraphs with half-page context in each direction
- A large language model agent that reads each "hit" and classifies it: is this the same person? Is this a passing mention or a full article about them?
- Automatic cross-referencing with other sources: personal diaries, minutes, letters — to verify identity based on dates and events
- Precise source citation: every result comes back with a direct link to the original scan, including issue number and page
The Results
Within three weeks — instead of a year and a half of manual search — we found:
- Over 400 direct mentions of the subject in contemporary press
- 17 full interviews and feature articles that were previously unknown, including a few buried in private collections
- Two public controversies she was involved in — visible only through cross-reading multiple papers in the same week
- A list of 30+ notable names who corresponded with her — identified by co-mentions in the press
What We Learned
This project taught us several principles we now apply to nearly every engagement:
- Quality OCR is the foundation for everything — there's no substitute for training on a period-relevant corpus
- Natural language beats keyword search — a modern LLM understands context like a human researcher, only a hundred times faster
- Source transparency is critical — every finding must be verifiable back to an original scan
- The human researcher is irreplaceable — AI accelerates collection, but historical interpretation remains ours
Right for You?
If you have a research project that involves large quantities of historical press, periodicals or journals — we're here to turn "impossible" into "three weeks of work."
