graviti
产品服务
解决方案
知识库
公开数据集
关于我们
avatar
Title-Based Semantic Subject Indexing
2D Classification
Text Detection
Text Detection
|...
许可协议: CC-BY-SA 4.0

Overview

Semantic Subject Indexing

Semantic subject indexing is the process of annotating documents with terms that describe what the document is about. This is often used in digital libraries to increase the findability of the documents. Annotations are usually created by human experts from the domain, who select appropriate terms from a pre-specified set of available labels. In order to keep up with the vast amount of new publications, (semi-)automatic tools are developed that assist the experts by suggesting them terms for annotation.
Unfortunately, due to legal restrictions these tools often cannot use the full-text nor the abstract of the publication. Therefore, it is desirable to explore techniques that work with the publications' metadata only. To some extent, it is already possible to achieve competitive performance to the full-text by merely using titles.
Yet, the performance of automatic subject indexing methods is still far from the level of human annotators. Semantic subject indexing can be framed as a multi-label classification problem, where the entry (i,j) of an indicator matrix is set to one if the label has been assigned to a document, or it is set to zero otherwise. A major challenge is that the label space is usually very large (up to almost 30,000), that the labels follow a power-law, and are subject to concept drift(cmp. Toepfer and Seifert).

Here, we provide two large-scale datasets from the domain of economics and business studies (EconBiz) and biomedicine (PubMed) used in our recent study, which each come with the title and respective annotated labels. Do you find valuable insights in the data that can help understand the problem of semantic subject indexing better? Can you come up with clever ideas that push the state-of-the-art in automatic semantic subject indexing? We are excited to see what the collective power of data scientists can achieve on this task!

Content

We compiled two English datasets from two digital libraries, EconBiz and PubMed.

EconBiz

The EconBiz dataset was compiled from a meta-data export provided by ZBW - Leibniz Information Centre for Economics from July 2017. We only retained those publications that were flagged as being in English and that were annotated with STW labels. Afterwards, we removed duplicates by checking for same title and labels. In total, approximately 1,064k publications remain.
The annotations were selected by human annotators from the Standard Thesaurus Wirtschaft (STW), which contains approximately 6,000 labels.

PubMed

The PubMed dataset was compiled from the training set of the 5th BioASQ challenge on large-scale semantic subject indexing of biomedical articles, which were all in English. Again, we removed duplicates by checking for same title and labels. In total, approximately 12.8 million publications remain.
The labels are so called MeSH terms. In our data, approximately 28k of them are used.

Fields
Both datasets share the same set of fields:

  • id: An identifier used to refer to the publication in the respective digital library.
  • title: The title of the publication
  • labels: A string that represents a list of labels, separated by TAB.
  • fold: For reproducibility of the results in our study: Number of the fold a sample belongs to as used in our study. 0 to 9 correspond to the samples that have a full-text, fold 10 to all other samples.

Acknowledgements

We would like to thank ZBW - Information Centre for Economics for providing the EconBiz dataset, and in particular Tamara Pianos and Tobias Rebholz.

We would also like to thank the team from the BioASQ challenge, from where we compiled the PubMed dataset. This organization is dedicated to advancing the state-of-the-art in large-scale semantic indexing. It is currently running the 6th iteration of their challenge, which you should definitely check out!

The PubMed dataset has been gathered by BioASQ following the terms from the U.S. National Library of Medicine regarding public use and redistribution of the data.

数据概要
数据格式
text, image,
数据量
2
文件大小
158.1MB
发布方
Florian Mai
| 数据量 2 | 大小 158.1MB
Title-Based Semantic Subject Indexing
2D Classification Text Detection
Text Detection
许可协议: CC-BY-SA 4.0

Overview

Semantic Subject Indexing

Semantic subject indexing is the process of annotating documents with terms that describe what the document is about. This is often used in digital libraries to increase the findability of the documents. Annotations are usually created by human experts from the domain, who select appropriate terms from a pre-specified set of available labels. In order to keep up with the vast amount of new publications, (semi-)automatic tools are developed that assist the experts by suggesting them terms for annotation.
Unfortunately, due to legal restrictions these tools often cannot use the full-text nor the abstract of the publication. Therefore, it is desirable to explore techniques that work with the publications' metadata only. To some extent, it is already possible to achieve competitive performance to the full-text by merely using titles.
Yet, the performance of automatic subject indexing methods is still far from the level of human annotators. Semantic subject indexing can be framed as a multi-label classification problem, where the entry (i,j) of an indicator matrix is set to one if the label has been assigned to a document, or it is set to zero otherwise. A major challenge is that the label space is usually very large (up to almost 30,000), that the labels follow a power-law, and are subject to concept drift(cmp. Toepfer and Seifert).

Here, we provide two large-scale datasets from the domain of economics and business studies (EconBiz) and biomedicine (PubMed) used in our recent study, which each come with the title and respective annotated labels. Do you find valuable insights in the data that can help understand the problem of semantic subject indexing better? Can you come up with clever ideas that push the state-of-the-art in automatic semantic subject indexing? We are excited to see what the collective power of data scientists can achieve on this task!

Content

We compiled two English datasets from two digital libraries, EconBiz and PubMed.

EconBiz

The EconBiz dataset was compiled from a meta-data export provided by ZBW - Leibniz Information Centre for Economics from July 2017. We only retained those publications that were flagged as being in English and that were annotated with STW labels. Afterwards, we removed duplicates by checking for same title and labels. In total, approximately 1,064k publications remain.
The annotations were selected by human annotators from the Standard Thesaurus Wirtschaft (STW), which contains approximately 6,000 labels.

PubMed

The PubMed dataset was compiled from the training set of the 5th BioASQ challenge on large-scale semantic subject indexing of biomedical articles, which were all in English. Again, we removed duplicates by checking for same title and labels. In total, approximately 12.8 million publications remain.
The labels are so called MeSH terms. In our data, approximately 28k of them are used.

Fields
Both datasets share the same set of fields:

  • id: An identifier used to refer to the publication in the respective digital library.
  • title: The title of the publication
  • labels: A string that represents a list of labels, separated by TAB.
  • fold: For reproducibility of the results in our study: Number of the fold a sample belongs to as used in our study. 0 to 9 correspond to the samples that have a full-text, fold 10 to all other samples.

Acknowledgements

We would like to thank ZBW - Information Centre for Economics for providing the EconBiz dataset, and in particular Tamara Pianos and Tobias Rebholz.

We would also like to thank the team from the BioASQ challenge, from where we compiled the PubMed dataset. This organization is dedicated to advancing the state-of-the-art in large-scale semantic indexing. It is currently running the 6th iteration of their challenge, which you should definitely check out!

The PubMed dataset has been gathered by BioASQ following the terms from the U.S. National Library of Medicine regarding public use and redistribution of the data.

0
立即开始构建AI
graviti
wechat-QR
长按保存识别二维码,关注Graviti公众号

Copyright@Graviti
沪ICP备19019574号
沪公网安备 31011002004865号