PAWS-X
Text
NLP
|...
许可协议: Custom

Overview

This dataset contains 23,659 human translated PAWS evaluation pairs and 296,406 machine translated training pairs in six typologically distinct languages: French, Spanish, German, Chinese, Japanese, and Korean. All translated pairs are sourced from examples in PAWS-Wiki. Note: for multilingual experiments, please use dev_2k.tsv provided in the PAWS-X repo as the development sets for all languages, including English.

Data Format

All files are in tsv format with four columns:

Column Name Data
id An ID that matches the ID of the source pair in PAWS-Wiki
sentence1 The first sentence
sentence2 The second sentence
label Label for each pair

The source text of each translation can be retrieved by looking up the ID in the corresponding file in PAWS-Wiki.

The numbers of examples for each of the six languages are shown below:

Language Train Dev Test
fr 49,401 1,992 1,985
es 49,401 1,962 1,999
de 49,401 1,932 1,967
zh 49,401 1,984 1,975
ja 49,401 1,980 1,946
ko 49,401 1,965 1,972
Total 296,406 11,815 11,844

Citation

If you use or discuss this dataset in your work, please cite our paper:

@InProceedings{pawsx2019emnlp,
  title = {{PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification}},
  author = {Yang, Yinfei and Zhang, Yuan and Tar, Chris and Baldridge, Jason},
  booktitle = {Proc. of EMNLP},
  year = {2019}
}

License

Custom

数据概要
数据格式
Text,
数据量
23.659K
文件大小
28.88MB
发布方
Google Research
Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field. Our researchers publish regularly in academic journals, release projects as open source, and apply research to Google products.
数据集反馈
| 47 | 数据量 23.659K | 大小 28.88MB
PAWS-X
Text
NLP
许可协议: Custom

Overview

This dataset contains 23,659 human translated PAWS evaluation pairs and 296,406 machine translated training pairs in six typologically distinct languages: French, Spanish, German, Chinese, Japanese, and Korean. All translated pairs are sourced from examples in PAWS-Wiki. Note: for multilingual experiments, please use dev_2k.tsv provided in the PAWS-X repo as the development sets for all languages, including English.

Data Format

All files are in tsv format with four columns:

Column Name Data
id An ID that matches the ID of the source pair in PAWS-Wiki
sentence1 The first sentence
sentence2 The second sentence
label Label for each pair

The source text of each translation can be retrieved by looking up the ID in the corresponding file in PAWS-Wiki.

The numbers of examples for each of the six languages are shown below:

Language Train Dev Test
fr 49,401 1,992 1,985
es 49,401 1,962 1,999
de 49,401 1,932 1,967
zh 49,401 1,984 1,975
ja 49,401 1,980 1,946
ko 49,401 1,965 1,972
Total 296,406 11,815 11,844

Citation

If you use or discuss this dataset in your work, please cite our paper:

@InProceedings{pawsx2019emnlp,
  title = {{PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification}},
  author = {Yang, Yinfei and Zhang, Yuan and Tar, Chris and Baldridge, Jason},
  booktitle = {Proc. of EMNLP},
  year = {2019}
}

License

Custom

数据集反馈
0
立即开始构建AI
graviti
wechat-QR
长按保存识别二维码,关注Graviti公众号