For additional details or contributions, please visit https://github.com/cneud/ocr-gt.
- Archiscribe
4255 lines from 112 19th Century German prints published across 73 years
https://github.com/jbaiter/archiscribe-corpus - CIS-OCR
PoCoTo example documents with ground truth
https://github.com/cisocrgroup/Resources/tree/master/ocrtestset - CLTK
Corpora from Classical Language Toolkit
https://github.com/cltk - DIVA-HisDB
DIVA-HisDB collection of three medieval manuscripts (CSG18, CSG863, CB55)
https://diuf.unifr.ch/main/hisdoc/diva-hisdb - EarlyPrintedBooks
8,800 lines from several early printed books
https://github.com/chreul/OCR_Testdata_EarlyPrintedBooks - EEBO-TCP
Early English Books Online (EEBO) documents transcribed by TCP
https://github.com/Anterotesis/historical-texts/tree/master/eebo-tcp - ECCO-TCP
Eighteenth Century Collections Online (ECCO) documents transcribed by TCP
https://github.com/Anterotesis/historical-texts/tree/master/eebo-tcp - eMOP-TCP
ECCO texts, cleaned up by eMOP
https://github.com/Early-Modern-OCR/TCP-ECCO-texts - Evans-TCP
Evans Early American Imprints documents transcribed by TCP
https://github.com/Anterotesis/historical-texts/tree/master/evans-tcp - FDHN
Finnish Digitised Historical Newspapers
https://digi.kansalliskirjasto.fi/opendata/submit?set_language=en - GERMANA
The GERMANA corpus “Doña Germana de Foix” (1891)
https://www.prhlt.upv.es/wp/resource/the-germana-corpus - GT4HistOCR
Ground Truth for German Fraktur and Early Modern Latin
https://doi.org/10.5281/zenodo.1344132 - imagessan
Sanskrit images & ground truth (Devanagari script)
https://github.com/Shreeshrii/imagessan/ - RODRIGO
The RODRIGO corpus “Don Rodrigo” (1545)
https://zenodo.org/record/1490009 - ENP
Historic newspapers from Europeana Newspapers
http://www.primaresearch.org/datasets/ENP - old-books
Old books GT from Project Gutenberg ebooks
https://github.com/PedroBarcha/old-books-dataset - OCR_GS_Data
Double-checked Arabic Gold Standard from OpenITI
https://github.com/OpenITI/OCR_GS_Data - OCR-D
Ground truth data from OCR-D
https://ocr-d-repo.scc.kit.edu/api/v1/metastore/bagit - OCR19thSAC
Text+Berg digital Swiss Alpine Club yearbooks
https://files.ifi.uzh.ch/cl/OCR19thSAC/ - MJSynth
9m synthetic images covering 90k words
http://www.robots.ox.ac.uk/~vgg/data/text/ - CDIP
IIT Complex Document Information Processing Dataset
https://data.nist.gov/od/id/mds2-2531 - IMPACT-BHL
IMPACT: Biodiversity Heritage Library
https://github.com/impactcentre/groundtruth-bhl - IMPACT-BL
IMPACT: The British Library
https://www.digitisation.eu/tools-resources/image-and-ground-truth-resources/impact-dataset-browser/?query=&search-filter-institution=BL - IMPACT-BNE
IMPACT: National Library of Spain
https://www.digitisation.eu/tools-resources/image-and-ground-truth-resources/impact-dataset-browser/?query=&search-filter-institution=BNE - IMPACT-BNF
IMPACT: National Library of France
https://www.digitisation.eu/tools-resources/image-and-ground-truth-resources/impact-dataset-browser/?query=&search-filter-institution=BNF - IMPACT-BSB
IMPACT: Bavarian State Library
https://www.digitisation.eu/tools-resources/image-and-ground-truth-resources/impact-dataset-browser/?query=&search-filter-institution=BSB - IMPACT-KB
IMPACT: National Library of the Netherlands
http://lab.kb.nl/dataset/ground-truth-impact-project#access - IMPACT-NKC
IMPACT: Czech National Library
https://www.digitisation.eu/tools-resources/image-and-ground-truth-resources/impact-dataset-browser/?query=&search-filter-institution=NKC - IMPACT-NLB
IMPACT: National Library of Bulgaria
https://www.digitisation.eu/tools-resources/image-and-ground-truth-resources/impact-dataset-browser/?query=&search-filter-institution=NLB - IMPACT-NUK
IMPACT: National Library of Slovenia
https://www.digitisation.eu/tools-resources/image-and-ground-truth-resources/impact-dataset-browser/?query=&search-filter-institution=NUK - IMPACT-ONB
IMPACT: Austrian National Library
https://www.digitisation.eu/tools-resources/image-and-ground-truth-resources/impact-dataset-browser/?query=&search-filter-institution=ONB - IMPACT-PSNC
IMPACT: Polish ground truth from four digital libraries
http://dl.psnc.pl/activities/projekty/impact/results/ - HORAE
HORAE: an annotated dataset of books of hours
https://github.com/oriflamms/HORAE - ONB_NewsEye
READ OCR training dataset from Austrian Newspapers
https://zenodo.org/record/3387369 - READ_Konzil
8770 transcribed lines of 18th century handwritten German
https://zenodo.org/record/215383 - READ_Bozen
Handwritten Early Modern German Ratsprotokolle
https://zenodo.org/record/218236 - NZZ_impresso
Neue Zürcher Zeitung black letter period
https://github.com/impresso/NZZ-black-letter-ground-truth - MiBio
Mining Biodiversity OCR dataset
https://github.com/jie-mei/MiBio-OCR-dataset - RASM2019
BL/Qatar Foundation Arabic Handwritten Ground Truth
https://bl.iro.bl.uk/work/f866aefa-b025-4675-b37d-44647649ba71 - jze_fraktur
Fraktur model for OCRopus training data
https://github.com/jze/ocropus-model_fraktur - Fibeln
Primers (19th century)
https://github.com/UB-Mannheim/Fibeln - Weisthuemer
Jacob Grimm Weisthuemer
https://github.com/UB-Mannheim/Weisthuemer - Wollmers_deu
German language (nature, biology) ground truth
https://github.com/wollmers/ocr-deu-bio-testfiles - Wollmers_lat
Latin language (nature, biology) ground truth
https://github.com/wollmers/ocr-lat-bio-testfiles - Wollmers_eng
English language (nature, biology) ground truth
https://github.com/wollmers/ocr-eng-bio-testfiles - AustrianNewspapers
Enhancement of ONB_NewsEye by UB Mannheim
https://github.com/UB-Mannheim/AustrianNewspapers - GBN
German-Brazilian historical newspapers
https://web.inf.ufpr.br/vri/databases/gbn/ - Salamanca
School of Salamanca works
https://www.salamanca.school/en/works.html - ULB_Halle_HP1
ULB Sachsen-Anhalt newspapers
https://github.com/ulb-sachsen-anhalt/ulb-zeitungsprojekt-hp1 - TIMEUS
French 18th/19th HTR by the ANR project TIME-US
https://github.com/HTR-United/timeuscorpus - TAPUS
French typewritten Ground Truth
https://github.com/HTR-United/tapuscorpus - DAHN
French typewritten Ground Truth
https://github.com/HTR-United/dahncorpus - DDI-100
Distorted Document Images dataset
https://github.com/machine-intelligence-laboratory/DDI-100 - sGMB
Synthetic handwritten Groningen Meaning Bank dataset
https://github.com/omni-us/research-dataset-sGMB - ocr-data
Historical prints from around 1830
https://zenodo.org/record/4742068 - Total-Text
Horizontal, Multi-Oriented, and Curved Text
https://github.com/cs-chan/Total-Text-Dataset - DocBank
Benchmark Dataset for Document Layout Analysis
https://doc-analysis.github.io/docbank-page/index.html - B-MOD
Brno Mobile OCR Dataset
https://pero.fit.vutbr.cz/brno_mobile_ocr_dataset - VOC_HTR
Dutch East-Asia Company documents
https://zenodo.org/record/4638495 - IAM
IAM Handwriting Database
https://fki.tic.heia-fr.ch/databases/iam-handwriting-database - HJDataset
Historical Japanese Documents with Complex Layouts
https://dell-research-harvard.github.io/HJDataset/ - PRImA Layout Analysis Dataset
A realistic contemporary document dataset
https://www.primaresearch.org/datasets/Layout_Analysis - ocr-greek_cursive
Greek cursive ground truth for Kraken and for Calamari
https://github.com/pharos-alexandria/ocr-greek_cursive - BIR-database
Database for typewritten emphasis (bold, italic, regular)
https://github.com/asciusb/BIR-database - indiscapes
Intelligent Historical Document Image Analytics initiative dataset
http://ihdia.iiit.ac.in/indiscapes/ - OCRcat
OCRcat training data
https://github.com/katabase/OCRcat - NZZ_impresso (UB Mannheim)
Fork of 00000036 incl. GT fixes by UB Mannheim)
https://github.com/UB-Mannheim/NZZ-black-letter-ground-truth/tree/gt-fixes - NewsEye / READ OCR training dataset
Swedish newspaper pages from late 18th-20th century with corrected text
https://zenodo.org/record/4599624 - OCR17+ - Layout analysis and text recognition for 17th c. French prints
The repo contains training data and models for layout analysis and text recognition for 17th c. French prints
https://github.com/Heresta/OCR17plus - CarolineMinuscule
Caroline Minuscule training pool of around 70 manuscripts
https://github.com/rescribe/carolineminuscule-groundtruth - Fraktur UB Tübingen
Fraktur/Gothic prints from the 19th Century
https://github.com/ubtue/gt-fraktur - ULB Sachsen-Anhalt newspapers
Training data of the DFG-Project "Zeitungsdigitalisierung Hauptphase I"
https://github.com/ulb-sachsen-anhalt/ulb-zeitungsprojekt-hp1