Question Pairs
Text
NLP
|...
许可协议: Custom

Overview

The dataset consists of over 400,000 lines of potential question duplicate pairs. Each line contains IDs for each question in the pair, the full text for each question, and a binary value that indicates whether the line truly contains a duplicate pair.
The distribution of questions in the dataset should not be taken to be representative of the distribution of questions asked on Quora. This is, in part, because of the combination of sampling procedures and also due to some sanitization measures that have been applied to the final dataset (e.g., removal of questions with extremely long question details).
The ground-truth labels contain some amount of noise: they are not guaranteed to be perfect.

Data Collection

Our original sampling method returned an imbalanced dataset with many more true examples of duplicate pairs than non-duplicates. Therefore, we supplemented the dataset with negative examples. One source of negative examples were pairs of “related questions” which, although pertaining to similar topics, are not truly semantically equivalent.

License

Custom

数据概要
数据格式
Text,
数据量
--
文件大小
55.48MB
发布方
Quora
Quaro is an American question-and-answer website where questions are asked, answered, followed, and edited by Internet users, either factually or in the form of opinions.
数据集反馈
| 33 | 数据量 -- | 大小 55.48MB
Question Pairs
Text
NLP
许可协议: Custom

Overview

The dataset consists of over 400,000 lines of potential question duplicate pairs. Each line contains IDs for each question in the pair, the full text for each question, and a binary value that indicates whether the line truly contains a duplicate pair.
The distribution of questions in the dataset should not be taken to be representative of the distribution of questions asked on Quora. This is, in part, because of the combination of sampling procedures and also due to some sanitization measures that have been applied to the final dataset (e.g., removal of questions with extremely long question details).
The ground-truth labels contain some amount of noise: they are not guaranteed to be perfect.

Data Collection

Our original sampling method returned an imbalanced dataset with many more true examples of duplicate pairs than non-duplicates. Therefore, we supplemented the dataset with negative examples. One source of negative examples were pairs of “related questions” which, although pertaining to similar topics, are not truly semantically equivalent.

License

Custom

数据集反馈
0
立即开始构建AI
graviti
wechat-QR
长按保存识别二维码,关注Graviti公众号