Within HisDoc 2.0 we have developed DIVA-HisDB a precisely annotated large dataset of challenging medieval manuscripts for the evaluation of several Document Image Analysis (DIA) tasks such as layout analysis, text line segmentation, binarization and writer identification. The database consists of 150 annotated pages of three different medieval manuscripts with challenging layouts.
DIVA-HisDB is a collection of three medieval manuscripts that have been chosen regarding the complexity of their layout , together with partners from e-codices and the Humanities faculty in the University of Fribourg:
- St. Gallen, Stiftsbibliothek, Cod. Sang. 18, codicological unit 4 (CSG18),
- St. Gallen, Stiftsbibliothek, Cod. Sang. 863 (CSG863),
- Cologny-Geneve, Fondation Martin Bodmer, Cod. Bodmer 55 (CB55).
And their corresponding GT visualization of the three annotation categories (main text body, comments, decorations):
DIVA-HisDB consists of 150 pages in total, 50 pages from each manuscript. For the dataset, as well as for the division into training, validation, and test set we have selected a representative set of pages.
Creation of DIVA-HisDB
For annotating the three selected medieval manuscripts we use GraphManuscribble, a semi-automatic tool which is based on document graphs and pen-based scribbling interaction. DIVA-HisDB provides the GT in the PAGE format .
We distinguish between the main text body, the comments (marginal and interlinear glosses, explanations, corrections) and decorations (every character/sign that exceeds the size of a text line and/or is written in red). Paleographic features like display scripts or black majuscules are not marked as a decoration, since they are too difficult to distinguish for the automatic layout analysis tool in this first step. We totally disregard elements that don't belong to a text line or are not characters/signs, like separation lines between two marginal glosses.
Within the comments' category, we don't distinguish between the original glosses, added by the writer of the main text, and the foliation added in the modern period. Also, if corrections have been made by overwriting the main text, they are not marked as posterior to the original main text; but if the corrections are made above and below the line, e.g., with a sign below the line to mark the affected letters, and the correct letter(s) above the line, they are annotated as comments.
Distribution of the annotation categories
For the whole DIVA-HisDB 63.97% of the total number of regions per annotated category are comments, 9.36% are decorations and 26.67% consists the main text body. Note that the proportions of the categories differ between the manuscripts. While the amount of textlines and comments is almost the same in CSG863, significantly less comments are present in CSG18. Additionally, by measuring the surface area (number of pixels) for each annotated category, 41.37% of the total surface area are comments, 1.69% are decorations and 56.94% consists the main text body.