graviti
产品服务
解决方案
知识库
公开数据集
关于我们
Wikipedia Generation Dataset
许可协议: Unknown

Overview

This directory contains the code and scripts to generate the dataset from the paper Generating Wikipedia by Summarizing Long Sequences. The task is to generate a Wikipedia article based on the contents of the cited references in that article and the top 10 Google search results for the article's title.

There are 2 sources for the reference URLs used:

  1. CommonCrawl, an open-source crawl of the web. The advantage of using CommonCrawl is that the dataset is perfectly reproducible. However, there is limited coverage of the reference URLs.
  2. Live web fetches. Coverage is considerably increased, but the content is subject to change.

This document provides instructions for producing both datasets.

Citation

Please use the following citation when referencing the dataset:

@article{liu2018generating,
  title={Generating wikipedia by summarizing long sequences},
  author={Liu, Peter J and Saleh, Mohammad and Pot, Etienne and Goodrich, Ben and Sepassi, Ryan and Kaiser, Lukasz and Shazeer, Noam},
  journal={arXiv preprint arXiv:1801.10198},
  year={2018}
}
数据概要
数据格式
数据量
--
文件大小
--
| 数据量 -- | 大小 --
Wikipedia Generation Dataset
许可协议: Unknown

Overview

This directory contains the code and scripts to generate the dataset from the paper Generating Wikipedia by Summarizing Long Sequences. The task is to generate a Wikipedia article based on the contents of the cited references in that article and the top 10 Google search results for the article's title.

There are 2 sources for the reference URLs used:

  1. CommonCrawl, an open-source crawl of the web. The advantage of using CommonCrawl is that the dataset is perfectly reproducible. However, there is limited coverage of the reference URLs.
  2. Live web fetches. Coverage is considerably increased, but the content is subject to change.

This document provides instructions for producing both datasets.

Citation

Please use the following citation when referencing the dataset:

@article{liu2018generating,
  title={Generating wikipedia by summarizing long sequences},
  author={Liu, Peter J and Saleh, Mohammad and Pot, Etienne and Goodrich, Ben and Sepassi, Ryan and Kaiser, Lukasz and Shazeer, Noam},
  journal={arXiv preprint arXiv:1801.10198},
  year={2018}
}
0
立即开始构建AI
graviti
wechat-QR
长按保存识别二维码,关注Graviti公众号

Copyright@Graviti
沪ICP备19019574号
沪公网安备 31011002004865号