A Glossary of AI for Historical Research

A practical reference, in plain language, for the technical vocabulary that surrounds AI-assisted historical research. Each entry is short enough to read and precise enough to quote. If you are a researcher, librarian, archivist, or genealogist, this is the glossary I send to new clients at the start of every project.

The glossary is organised in four clusters: core technologies, methods, risks and controls, and historical resources and sources — because without cultural context, technical words help nobody.

Core Technologies

OCR (Optical Character Recognition)

Conversion of an image of printed text into machine-readable characters. Modern OCR engines combine convolutional neural networks for character classification with language models for context-dependent correction. Works well on clean printed text; struggles with damaged sources, multi-column layouts, and historical typefaces unless the model has been trained on similar material.

HTR (Handwritten Text Recognition)

The cousin of OCR for handwritten material — where letter shapes vary by writer, period, region, and writing instrument. HTR is fundamentally a sequence modelling problem rather than character classification; most modern systems are based on transformer encoders or recurrent networks trained with CTC. Custom training of a few hundred lines of a specific hand raises accuracy from 60% to 95%+.

LLM (Large Language Model)

A neural network trained on vast amounts of text to predict the next token in a sequence. Modern models (GPT-4, Claude, Gemini, Llama) develop capabilities of summarization, translation, entity extraction, and historical reasoning. In archival work an LLM almost never operates alone — it is always grounded in sources retrieved through RAG.

RAG (Retrieval-Augmented Generation)

A pattern that combines a retrieval system (search across real documents) with a generative LLM (composing answers). In historical research, RAG is the difference between an AI that hallucinates plausible-sounding biographies and an AI that answers only from sources it can cite. Typical RAG pipeline: query → embedding → vector search → top-k passages → LLM with passages and source metadata → answer with footnotes.

Embedding

A numerical vector (usually 768–3072 dimensions) representing the meaning of a passage of text in a way computers can compare. Two texts about the same person — even in different languages — produce nearby vectors. Embeddings turn semantic search ("find documents about the 1929 riots") from a fantasy into a fast database query.

Vector Database

A purpose-built database optimised for nearest-neighbour search across millions of high-dimensional embeddings. Common engines: Pinecone, Weaviate, Qdrant, pgvector, Chroma. The performance ceiling of an archival search system is typically determined by the quality of its vector database relative to the specific embedding model in use.

Transformer

The neural network architecture introduced in 2017 ("Attention Is All You Need") and powering nearly every modern LLM, OCR system, and HTR system. The core idea is "attention": the model learns which parts of the input matter for each part of the output. Replaced RNN/LSTM architectures for almost all NLP tasks.

NER (Named Entity Recognition)

The task of locating and classifying mentions of people, places, organisations, dates, events, and other categories within text. Used to turn unstructured archival text into structured data: every appearance of "R. Yosel of Tarnov" becomes a node in a knowledge graph, even when the name is spelled five different ways across documents.

Knowledge Graph

A network of entities (nodes) and relationships (edges) extracted from documents. A historical knowledge graph might link a person to their place of birth, their community, their occupations, their family members, and the documents in which each fact appears. Supports queries no flat database can: "show me every rabbi who left Galicia for Argentina between 1880 and 1910."

IIIF (International Image Interoperability Framework)

A standard that lets institutions share high-resolution archival images through interoperable APIs. Allows AI systems to retrieve, annotate, and link images across archives without copying the files themselves. Most major libraries and archives now publish IIIF endpoints.

Methods

Fine-tuning

Taking a pretrained model and continuing its training on a small, task-specific dataset. In historical research, fine-tuning is how a generic Hebrew OCR engine becomes a specialist in 19th-century rabbinical print. A few thousand transcribed lines is usually enough.

LoRA (Low-Rank Adaptation)

A lightweight fine-tuning technique that updates only a small number of injected parameters, not the entire model. Makes per-archive — or even per-scribe — custom model training economical.

Entity Linking

The step after NER: deciding that "M. Frankelson" in Document A and "Mati Frankelson" in Document B and "M. F." in Document C all refer to the same person. The hardest task in any genealogical or biographical AI pipeline. Solved by combining name normalization, contextual features, and explicit human review.

Disambiguation

Choosing among plausible candidates for an ambiguous mention — for example, distinguishing among three Rabbi Yosels who operated in the same town in the same decade. Disambiguation typically relies on contextual signals: dates, occupations, and family members mentioned nearby.

Ground Truth

A manually verified dataset used to train and evaluate models. In HTR, ground truth is human transcription line by line. In NER, it is manually tagged entities. The quality of any AI system in historical research is bounded by the quality of its ground truth.

Risks and Controls

Hallucination

The tendency of an LLM to generate confident-sounding but factually invented output. The single biggest risk in any AI-assisted historical research project. Mitigated through RAG, source attribution, confidence scoring, structured output formats, and human review of uncertain passages — never fully eliminated, only managed.

Source Attribution

The discipline of tagging every claim an AI produces with the archival document(s) that support it. Non-negotiable in academic work. A research system without source attribution is not a research system; it is a guessing machine.

Confidence Score

A numerical estimate of the reliability of an OCR output, transcription, or extracted entity. Critical for triage: the cheapest way to high overall accuracy is to spend human review effort only on low-confidence outputs.

Common Crawl

A massive, publicly available corpus of crawled web pages on which most large language models are trained. If your site appears in Common Crawl, a model trained after that crawl date may "know" you. This is one of the few mechanisms by which a website becomes "known" to AI systems.

Historical Resources and Sources

Pinkas Kehila

A community register kept by an autonomous Jewish community (kehila), recording births, marriages, deaths, taxes, communal decisions, charity, and disputes. A primary source for reconstructing Jewish life from the 16th through the 20th century. Usually written in mixed Hebrew-Yiddish and the local vernacular, in a difficult rabbinic hand. A primary target for HTR systems.

Yizkor Book

A memorial volume composed by Holocaust survivors for a destroyed Jewish community — including the names of the perished, the geography of the lost shtetl, biographical sketches, and historical narrative. Tens of thousands have been published; many remain only partially indexed. AI is now making them globally searchable for the first time.

Ketubah

A Jewish marriage contract, sometimes ornately illuminated, recording the spouses, their families, the date, and the husband's financial obligations. A foundational source for genealogy and the social history of Jewish communities. Decoding ketubot at scale requires combining HTR with visual analysis.

Yad Vashem

The World Holocaust Remembrance Center in Jerusalem, holding the world's largest archive of Holocaust documentation and victim records. The "Pages of Testimony" database alone contains more than 4.8 million records. AI is dramatically accelerating work across this collection.

Arolsen Archives

The International Center on Nazi Persecution, located in Arolsen, Germany. Holds approximately 30 million documents on victims of the Nazis, including transport lists, camp records, and post-war survivor documentation. A central target for cross-archive AI research.

JewishGen

A comprehensive resource for Jewish family history, hosting databases of records, immigrant lists, town histories, and discussion groups across most regions of the diaspora. AI agents that crawl JewishGen alongside other archives unlock cross-source family discoveries.

Kurrent / Sütterlin

German handwritten scripts used widely until World War II — Kurrent from the 16th century onwards, Sütterlin as a 20th-century simplification. Nearly unreadable to modern German speakers without training. A distinct HTR target with a finite, well-defined training corpus.

How to Use This Glossary

This glossary is built for three uses:

First, to map shared vocabulary with clients, researchers, and institutions I work with. Conversations about AI in historical research often fail because the parties mean different things when they say "OCR" or "knowledge graph." Standard entries shorten that misunderstanding.

Second, as a source document that other articles, proposals, and meeting notes can reference, without having to repeat explanations in every document.

Third, as a starting point for inquiry — if a term is missing or an explanation seems misleading, write to maaty@mf-sr.com. The glossary grows with reader questions.

This glossary will be updated periodically. Terms scheduled for the next revision include: speculative decoding, multimodal embeddings, OCR-on-OCR (cross-engine validation), agentic research workflows, and temporal entity resolution. If any of those is relevant to your project — I would be glad to talk.