Digital Archiving in the AI Era
In the dusty shelves of forgotten libraries, history waits to be rediscovered. For decades, "digitization" meant taking a picture — preserving the form but not the content. A scanned page is not the same as a page that can be searched, cross-referenced, or asked questions of. The leap from one to the other is what separates the museum-grade scans of the 2000s from the AI-native archives we are building now.
Beyond simple digitization
The first generation of digital archives solved a real problem: physical documents decay, get lost, and sit behind locked doors no researcher can travel to. Scanning everything at high resolution and putting the images online was an enormous public good. Most of the major Jewish, Russian, and European archives ran this exercise between roughly 2005 and 2018.
But it stopped there. A scan is a replica, not a resource. You can look at it; you cannot ask it anything. If you wanted to find every reference to a specific shtetl across forty rabbinical responsa volumes, you still had to read them yourself — exactly as your grandfather did with the paper originals.
From pixels to meaning
What changes in the AI era is the entire stack underneath the scan.
Modern OCR has caught up with print. HTR (handwritten text recognition) — fundamentally a different problem — has caught up with cursive. Together, they can now produce searchable text from a 19th-century Yiddish ledger or a Soviet-era typed letter with accuracy that was unthinkable a decade ago.
But text is only the second layer. The third — and the one that distinguishes a true AI-native archive — is meaning. When a system knows that "יעקב פרלמוטר" on page 47 of one ledger and "Yankel Perelmuter" on page 12 of an immigration manifest are the same person, the archive becomes something new: a graph of named entities, dates, places, and relationships, woven across documents that may have been physically separated for a century.
The three layers
A complete AI-native archive operates at three levels:
- The pixel layer — the high-resolution scan, treated as the canonical object of preservation. This is what you point at when someone asks "is this real?"
- The text layer — OCR or HTR output, plus structured transcription. This is what makes the archive searchable.
- The knowledge layer — entities (people, places, organizations, events), the relationships between them, and a provenance trail back to the exact pixel coordinates where each fact was extracted.
Each layer requires different tools and different judgment. Pixel preservation is a metadata and storage problem. Text extraction is a machine learning problem. The knowledge layer is where archival scholarship and AI engineering have to genuinely collaborate — because the question of whether two name variants are "the same person" is rarely something a model can answer alone.
A concrete example
Consider a community memorial book — a yizkor buch — produced in the 1950s by survivors of a destroyed Eastern European Jewish community. It contains hundreds of names, often spelled inconsistently, in a mixture of Hebrew, Yiddish, and the local language. A traditional digitization gives you a beautiful PDF. An AI-native digitization gives you that PDF plus:
- Every name extracted, normalized, and linked to its variants
- Every place name connected to a modern coordinate, where possible
- Cross-references to other archives — Pages of Testimony at Yad Vashem, JewishGen databases, town records held in Polish or Lithuanian state archives — wherever those people or places appear elsewhere
The result is not just a searchable book. It is a node in a network that lets a great-grandchild searching for a single ancestor's name surface, in seconds, every document in the world that mentions that person.
Why this matters now
The window for getting this right is narrower than people think. The handwritten and typed records of the 19th and early 20th centuries are not getting easier to read; the people who can still decipher faded Cyrillic cursive or rabbinical shorthand are aging out. Building the AI infrastructure now — while there is still expert knowledge to fine-tune models on — is the work of this decade.
"The past is not dead. It's not even past." — William Faulkner
We are making sure it stays accessible.
