icwb2
Text
NLP
|...
许可协议: Custom

Overview

The Second International Chinese Word Segmentation Bakeoff took place over the summer of 2005 and the results were presented at the 4th SIGHAN Workshop, held at IJCNLP'05, October 14-15.

Corpora from the following organizations were used:

  • CKIP, Academia Sinica, Taiwan
  • City University of Hong Kong, Hong Kong SAR
  • Beijing Universty, China
  • Microsoft Research, China

Data Collection

Four corpora are available for this bakeoff:

Corpus Encoding Word Types Words Character Types Characters
Traditional Chinese
Academia Sinica Unicode/Big Five Plus 141,340 5,449,698 6,117 8,368,050
City University of Hong Kong HKSCS Unicode/Big Five 69,085 1,455,629 4,923 2,403,355
Simplified Chinese
Peking University CP936/Unicode 55,303 1,109,947 4,698 1,826,448
Microsoft Research CP936/Unicode 88,119 2,368,391 5,167 4,050,469

License

Custom

数据概要
数据格式
Text,
数据量
--
文件大小
50.2MB
发布方
The University of Chicago
The University of Chicago is a private research university in Chicago, Illinois
数据集反馈
| 23 | 数据量 -- | 大小 50.2MB
icwb2
Text
NLP
许可协议: Custom

Overview

The Second International Chinese Word Segmentation Bakeoff took place over the summer of 2005 and the results were presented at the 4th SIGHAN Workshop, held at IJCNLP'05, October 14-15.

Corpora from the following organizations were used:

  • CKIP, Academia Sinica, Taiwan
  • City University of Hong Kong, Hong Kong SAR
  • Beijing Universty, China
  • Microsoft Research, China

Data Collection

Four corpora are available for this bakeoff:

Corpus Encoding Word Types Words Character Types Characters
Traditional Chinese
Academia Sinica Unicode/Big Five Plus 141,340 5,449,698 6,117 8,368,050
City University of Hong Kong HKSCS Unicode/Big Five 69,085 1,455,629 4,923 2,403,355
Simplified Chinese
Peking University CP936/Unicode 55,303 1,109,947 4,698 1,826,448
Microsoft Research CP936/Unicode 88,119 2,368,391 5,167 4,050,469

License

Custom

数据集反馈
0
立即开始构建AI
graviti
wechat-QR
长按保存识别二维码,关注Graviti公众号