Glossary — AI for Historical Research

A practical, plain-language reference for the technical vocabulary used across MF Smart Research projects. Each entry is short enough to be readable, precise enough to be cited.


OCR (Optical Character Recognition)

The conversion of an image of printed text into machine-readable characters. Modern OCR engines combine convolutional neural networks for character classification with language models for context-aware correction. Works well on clean printed text; struggles on degraded paper, mixed columns and historical typefaces unless the model is fine-tuned on similar material.

HTR (Handwritten Text Recognition)

The cousin of OCR designed for handwriting, where letterforms vary by author, period, region and writing instrument. HTR is fundamentally a sequence-modeling problem rather than a character-classification one — most modern systems use transformer encoders or CTC-trained recurrent networks. Custom training on a few hundred lines of a specific hand can lift accuracy from below 60% to above 95%.

LLM (Large Language Model)

A neural network trained on vast amounts of text to predict the next token in a sequence. Modern LLMs (GPT-4, Claude, Gemini, Llama) acquire emergent abilities to summarize, translate, extract entities and reason about historical context. In archival work an LLM is rarely used alone — it is always grounded in retrieved sources via RAG.

RAG (Retrieval-Augmented Generation)

A pattern that combines a retrieval system (searching real documents) with a generative LLM (composing answers). For historical research, RAG is the difference between an AI that hallucinates plausible-sounding biographies and an AI that answers only from sources it can cite. A typical RAG pipeline: query → embedding → vector search → top-k passages → LLM with passages and source metadata → answer with footnotes.

Embedding

A numeric vector (often 768 to 3072 dimensions) representing the meaning of a piece of text in a way computers can compare. Two texts about the same person — even in different languages — produce nearby vectors. Embeddings turn semantic search ("find documents about the 1929 riots") from a fantasy into a fast database query.

Vector Database

A specialized database optimized for nearest-neighbor search across millions of high-dimensional embeddings. Common engines: Pinecone, Weaviate, Qdrant, pgvector, Chroma. The performance ceiling for an archive search system is usually set by how well its vector database handles its specific embedding model.

NER (Named Entity Recognition)

The task of finding and classifying mentions of people, places, organizations, dates, events and other categories inside text. Used to turn unstructured archival text into structured data: every appearance of "Reb Yossel of Tarnow" becomes a node in a knowledge graph, even if his name is spelled five different ways across documents.

Knowledge Graph

A network of entities (nodes) and relationships (edges) extracted from documents. A historical knowledge graph might link a person to their birthplace, congregation, professions, family members and the documents in which each fact appears. Supports queries that no flat database can answer: "show me all rabbis who left Galicia for Argentina between 1880 and 1910."

Entity Linking

The step after NER: deciding that "M. Frankelson" in document A and "Maaty Frankelson" in document B and "מתי פ." in document C all refer to the same person. The hardest task in any genealogical or biographical AI pipeline. Solvable through a combination of name normalization, contextual features and explicit human review.

Disambiguation

Choosing between candidate referents for an ambiguous mention — for example, distinguishing between three different Reb Yossels active in the same town in the same decade. Disambiguation usually relies on contextual signals: dates, occupations, family members mentioned nearby.

Transformer

The neural-network architecture introduced in 2017 ("Attention Is All You Need") that powers nearly every modern LLM, OCR system and HTR system. The key idea is "attention": the model learns which parts of the input matter for each part of the output. Replaces older RNN/LSTM architectures for almost all NLP tasks.

Fine-tuning

Taking a pre-trained model and training it further on a smaller, task-specific dataset. For historical research, fine-tuning is how a generic Hebrew OCR engine becomes an expert in 19th-century rabbinical printing. A few thousand transcribed lines is often enough.

LoRA (Low-Rank Adaptation)

A lightweight fine-tuning technique that updates only a small number of injected parameters rather than the whole model. Makes it economically viable to train custom models for individual archives or even individual scribes' hands.

Hallucination

An LLM's tendency to produce confident-sounding but factually invented output. The single largest risk in any AI-assisted historical research project. Mitigated through RAG, source attribution, confidence scoring, structured output formats and human review on uncertain passages — never eliminated, only managed.

Source Attribution

The discipline of tagging every AI-generated claim with the specific archival document(s) that support it. Non-negotiable in scholarly work. A research system without source attribution is not a research system; it is a guessing machine.

Confidence Score

A numerical estimate of how reliable a particular OCR output, transcription, or extracted entity is. Critical for triage: the cheapest path to high overall accuracy is to spend human review only on low-confidence outputs.

Ground Truth

A manually verified reference dataset used to train and evaluate models. For HTR, ground truth is line-by-line human transcription. For NER, it is human-tagged entities. The quality of any AI system in historical research is bounded by the quality of its ground truth.

Common Crawl

A massive open dataset of crawled web pages used to train most large language models. If your website appears in Common Crawl, an LLM trained after that crawl date may have learned about you. This is one of the few mechanisms by which a website becomes "known" to AI systems.

Pinkas Kehila (פנקס קהילה)

A community ledger maintained by a Jewish self-governing community (kehila), recording births, marriages, deaths, taxes, communal decisions, charity and disputes. A primary source for reconstructing Jewish life from the 16th to the 20th century. Often written in mixed Hebrew, Yiddish and the local vernacular, in difficult rabbinical hands. A standing target for HTR systems.

Yizkor Book (ספר יזכור)

A memorial book composed after the Holocaust by survivors of a destroyed Jewish community, recording the names of victims, the geography of the lost town, biographical sketches and historical narrative. Tens of thousands have been published; many remain only partially indexed. AI is now making them globally searchable for the first time.

Ketuba (כתובה)

A Jewish marriage contract, often beautifully illuminated, recording the spouses, their families, the date and the financial commitments of the husband. A core source for genealogy and for the social history of Jewish communities. Decoding ketubot at scale requires both HTR and visual analysis.

Yad Vashem

The World Holocaust Remembrance Center in Jerusalem, holding the largest archive of Holocaust documentation and victim records globally. Their Pages of Testimony database alone contains over 4.8 million entries. AI-assisted research dramatically accelerates work across this collection.

Arolsen Archives

The International Center on Nazi Persecution in Arolsen, Germany. Holds approximately 30 million documents on victims of Nazi persecution, including transport lists, camp records and post-war displaced-persons documentation. A primary target for cross-archive AI research.

JewishGen

A comprehensive online resource for Jewish family history, hosting databases of vital records, immigration records, town histories and discussion groups across most regions of the diaspora. AI agents that traverse JewishGen alongside other archives unlock cross-source family discoveries.

Kurrent / Sütterlin

German cursive scripts in widespread use until World War II — Kurrent from the 16th century onward, Sütterlin as a 20th-century simplification. Almost unreadable to modern German speakers without training. A specialized HTR target with a clear, finite training set.

IIIF (International Image Interoperability Framework)

A standard that allows institutions to share high-resolution archival images via interoperable APIs. Enables AI systems to fetch, annotate and link images across archives without copying the underlying assets. Most major libraries and archives now publish IIIF endpoints.


Have a term you want added? Email [email protected] — this glossary grows with reader questions.