Overview
The dataset consists of over 400,000 lines of potential question duplicate pairs. Each line
contains IDs for each question in the pair, the full text for each question, and a binary value
that indicates whether the line truly contains a duplicate pair.
The distribution of questions
in the dataset should not be taken to be representative of the distribution of questions asked
on Quora. This is, in part, because of the combination of sampling procedures and also due
to some sanitization measures that have been applied to the final dataset (e.g., removal of
questions with extremely long question details).
The ground-truth labels contain some amount of noise: they are not guaranteed to be perfect.
Data Collection
Our original sampling method returned an imbalanced dataset with many more true examples of duplicate pairs than non-duplicates. Therefore, we supplemented the dataset with negative examples. One source of negative examples were pairs of “related questions” which, although pertaining to similar topics, are not truly semantically equivalent.