posted by user: jurialex || 59 views || tracked by 1 users: [display]

HIPE-OCRepair 2026 : HIPE-OCRepair 2026 - ICDAR Competition on LLM-Assisted OCR Post-Correction

FacebookTwitterLinkedInGoogle

Link: https://hipe-eval.github.io/HIPE-OCRepair-2026/
 
When Sep 2, 2026 - Sep 4, 2026
Where at ICDAR 2026
Submission Deadline TBD
 

Call For Papers

(apologies for cross-postings)

====

HIPE-OCRepair 2026 - Historical OCR Post-Correction Shared Task

Website: https://hipe-eval.github.io/HIPE-OCRepair-2026/

Task: LLM-Assisted OCR Post-Correction for Multilingual Historical Documents

Venue: ICDAR 2026 (31 Aug - 4th Sep 2026)

====

Data: https://github.com/hipe-eval/HIPE-OCRepair-2026-data

How-to: Participation Guidelines: https://github.com/hipe-eval/HIPE-OCRepair-2026-data/blob/main/README-Participation-Guidelines.md

Scorer: https://github.com/hipe-eval/HIPE-OCRepair-scorer/

====

We invite participation in HIPE-OCRepair 2026, the ICDAR 2026 Competition on LLM-Assisted OCR Post-Correction for Historical Documents.

Large-scale digitized historical collections still contain substantial OCR errors. Re-processing millions of pages with improved engines is rarely feasible, making post-correction the most viable strategy for addressing the OCR debt accumulated in digital heritage collections. Recent progress in large language models opens promising new directions, but their effectiveness varies across languages and error types, and they may introduce hallucinations or unintended alterations.

To what extent can modern large language models address the OCR debt accumulated in large-scale digitized historical collections?

HIPE-OCRepair 2026 addresses this question through HIPE-OCRepair-Bench, a unified multilingual benchmark comprising curated datasets, a standardised evaluation protocol, baseline systems, and an open leaderboard.


Task

Participants correct noisy OCR transcripts of historical documents without access to the original images. For each text chunk (typically a paragraph or article), the dataset provides:

- one OCR hypothesis

- document metadata (language, date, publication title)

- OCR quality indicators (CER, WER, lexicon-based quality score)

Systems must produce improved corrected text. Both generative (LLM-based) and discriminative or hybrid approaches are welcome.


Data


The benchmark consists of parallel OCR and ground truth data drawn from multiple curated historical collections, covering English, French, and German materials from primarily the 17th to the 20th century, including newspapers and printed works. It consolidates existing resources alongside newly curated materials.


Important dates

- 10 Dec 2025: Sample data release

- 02 Mar 2026: Training and development data release; scorer

- 23 Mar 2026: Hugging Face leader board release

- 06-08 Apr 2026: Evaluation phase (test release & submission)

- 10 Apr 2026: Results publication

- 31 Aug-4 Sep 2026: Presentation at ICDAR 2026


HIPE-OCRepair addresses a central challenge for the document analysis, NLP, and digital humanities communities: improving the usability of large historical text collections at scale. It offers a reproducible evaluation framework, openly available data and tools, and a persistent leaderboard for ongoing benchmarking beyond the competition itself.

We look forward to your participation!


Best regards,

HIPE-OCRepair 2026 Organizers

https://hipe-eval.github.io/HIPE-OCRepair-2026/

Related Resources

CLEF -HIPE 2026   Shared Task on Person-Place Relation Extraction from Multilingual Historical Texts
Special Issue IEEE TSC: LLM-SOA 2026   Special Issue on Large Language Models in Service-Oriented Ecosystems Design: Advances and Applications
LLM-SOA 2026   LLM-SOA@CAiSE 2026: 2nd Workshop on Large Language Models for Service-Oriented Architectures and Systems Design @ CAiSE 2026