graviti
产品服务
解决方案
知识库
公开数据集
关于我们
avatar
Wikipedia Dump 20200820
Text Detection
Text Detection
|...
许可协议: CC-BY-SA 4.0

Overview

Context

This is the full August 8th, 2020 dump from English Wikipedia. I obtained the full dataset to practice using word embedding algorithms like word2vec. I was unable to find dumps on Kaggle. I found this to be an odd omission, so I thought I should just upload it, data quotas be damned. I did the extraction using gensim's Wiki corpus API .

Content

The data comes in a giant text file where each line is a Wikipedia article. Punctuation has been stripped and all words are lower-cased.

Acknowledgements

The thing that clued me onto using Wikipedia's dataset was this tutorial, and the article that introduced FastText.

数据概要
数据格式
text,
数据量
1
文件大小
694.98MB
发布方
movinglinguini
| 数据量 1 | 大小 694.98MB
Wikipedia Dump 20200820
Text Detection
Text Detection
许可协议: CC-BY-SA 4.0

Overview

Context

This is the full August 8th, 2020 dump from English Wikipedia. I obtained the full dataset to practice using word embedding algorithms like word2vec. I was unable to find dumps on Kaggle. I found this to be an odd omission, so I thought I should just upload it, data quotas be damned. I did the extraction using gensim's Wiki corpus API .

Content

The data comes in a giant text file where each line is a Wikipedia article. Punctuation has been stripped and all words are lower-cased.

Acknowledgements

The thing that clued me onto using Wikipedia's dataset was this tutorial, and the article that introduced FastText.

0
立即开始构建AI
graviti
wechat-QR
长按保存识别二维码,关注Graviti公众号

Copyright@Graviti
沪ICP备19019574号
沪公网安备 31011002004865号