PubTabNet
Text
OCR/Text Detection
|...
许可协议: CDLA-Permissive 1.0

Overview

PubTabNet contains heterogeneous tables in both image and HTML format. PubTabNet can be used to train and evaluate image-based table recognition models. The model needs to recognize both the structure and the content of the tables, and be able to reconstruct the HTML representation of the tables solely relying on the table images. The HTML representation encodes both the structure of the tables and the content in each table cell. Position (bounding box) of table cells is also provided to support more diverse model designs. The source of the tables is PubMed Central Open Access Subset (commercial use collection). The tables (in both image and HTML format) are automatically extracted by matching the PDF format and the XML format of the articles in the PubMed Central Open Access Subset.

Data Collection

The table images are extracted from the scientific publications included in the PubMed Central Open Access Subset (commercial use collection). Table regions are identified by matching the PDF format and the XML format of the articles in the PubMed Central Open Access Subset.

Data Annotation

The annotation is in the jsonl (jsonlines) format, where each line contains the annotations on a given sample in the following format: The structure of the annotation jsonl file is:

{
   'filename': str,
   'split': str,
   'imgid': int,
   'html': {
     'structure': {'tokens': [str]},
     'cell': [
       {
         'tokens': [str],
         'bbox': [x0, y0, x1, y1]  # only non-empty cells have this attribute
       }
     ]
   }
}

Citation

@article{zhong2019image,
  title={Image-based table recognition: data, model, and evaluation},
  author={Zhong, Xu and ShafieiBavani, Elaheh and Yepes, Antonio Jimeno},
  journal={arXiv preprint arXiv:1911.10683},
  year={2019}
}

License

CDLA-Permissive 1.0

数据概要
数据格式
Image,
数据量
--
文件大小
10.46GB
发布方
PubMed Central
PubMed Central® (PMC) is a free full-text archive of biomedical and life sciences journal literature at the U.S. National Institutes of Health's National Library of Medicine (NIH/NLM).
数据集反馈
出错了
刚刚
timeout_error
立即开始构建AI
出错了
刚刚
timeout_error