From Archive to Reader: A Historical Newspapers Research Project with AI

One of our recent projects is a perfect illustration of the power of combining classical historical methodology with modern AI technology: tracking a specific historical figure's presence across decades of early Hebrew-language press.

The Challenge: A Needle in a Haystack of Crumbling Paper

The client — a researcher writing a biography of a public figure from the pre-state Yishuv era — needed to gather every mention of the subject across Hebrew newspapers between 1910 and 1948. The potential sources included:

HaTzvi and Herzl-era papers
Doar HaYom edited by Itamar Ben-Avi
Haaretz from the 1920s and 30s
Davar, the Histadrut organ
HaBoker, HaTzofe, Yediot Aharonot and more

Total material: thousands of issues, tens of thousands of pages. Manual search would have taken months, perhaps years.

Stage One: OCR Quality for Historical Type

National digitization projects exist (such as Israel's National Library "Historical Jewish Press"), but their OCR quality is limited — especially for:

Faded ink on crumbling issues
Complex page layouts with narrow columns, advertisements, mixed typefaces
Proper names and foreign transliterations, where OCR engines typically fail
Censored sections from the British Mandate era, where redaction marks confuse the recognizer

Our solution: we ran a supplementary custom OCR pass on the relevant issues, training the model on a corpus of Hebrew press from the same period. Accuracy on proper names jumped from 72% to approximately 94%.

Stage Two: Building a Smart Search Engine with RAG

Raw OCR text is not enough — so we built a Retrieval-Augmented Generation (RAG) pipeline. A person's name can appear in dozens of variants: initials, nicknames, full or defective spellings, typos. And simple boolean search is insufficient — we want to understand the context of every mention.

The system we built:

A vector database of all relevant paragraphs with half-page context in each direction
A large language model agent that reads each "hit" and classifies it: is this the same person? Is this a passing mention or a full article about them?
Automatic cross-referencing with other sources: personal diaries, minutes, letters — to verify identity based on dates and events
Precise source citation: every result comes back with a direct link to the original scan, including issue number and page

The Results

Within three weeks — instead of a year and a half of manual search — we found:

Over 400 direct mentions of the subject in contemporary press
17 full interviews and feature articles that were previously unknown, including a few buried in private collections
Two public controversies she was involved in — visible only through cross-reading multiple papers in the same week
A list of 30+ notable names who corresponded with her — identified by co-mentions in the press

What We Learned

This project taught us several principles we now apply to nearly every engagement:

Quality OCR is the foundation for everything — there's no substitute for training on a period-relevant corpus
Natural language beats keyword search — a modern LLM understands context like a human researcher, only a hundred times faster
Source transparency is critical — every finding must be verifiable back to an original scan
The human researcher is irreplaceable — AI accelerates collection, but historical interpretation remains ours

Right for You?

If you have a research project that involves large quantities of historical press, periodicals or journals — we're here to turn "impossible" into "three weeks of work."

Talk to us about your press research project