ICDAR2019 Post-OCR Text Correction
许可协议:
Custom
Overview
This original corpus consist in OCRed documents from 10 European languages with about 20M characters
(3.5M tokens) aligned with their corresponding Gold Standard (Ground-Truth). Each language contain
one or several sub-folders (unbalanced) according to collected dataset sources as follows:
Dataset details : partitioning
The original excel form
click here.
Each training file contain three blocs according to the following structure. Note that only the first
block [OCR_output] will be included in the test set.
Citation
@inproceedings{rigaud2019pocr,
title="ICDAR 2019 Competition on Post-OCR Text Correction",
author={Rigaud, Christophe and Doucet, Antoine and Coustaty, Mickael and Moreux, Jean-Philippe},
year={2019},
booktitle={Proceedings of the 15th International Conference on Document Analysis and Recognition (2019)}
}