MAGICDATA Mandarin Chinese Read Speech Corpus
许可协议:
CC BY-NC-ND 4.0
Overview
The corpus is a subset of a much bigger data ( 10566.9 hours Chinese Mandarin Speech Corpus ) set which was recorded in the same environment. The corpus aims to support researchers in speech recognition, machine translation, speaker recognition, and other speech-related fields. Therefore, the corpus is totally free for academic use.
Data Format
The contents and the corresponding descriptions of the corpus include:
- The corpus contains 755 hours of speech data, which is mostly mobile recorded data.
- 1080 speakers from different accent areas in China are invited to participate in the recording.
- The sentence transcription accuracy is higher than 98%.
- Recordings are conducted in a quiet indoor environment.
- The database is divided into training set, validation set, and testing set in a ratio of 51: 1: 2.
- Detail information such as speech data coding and speaker information is preserved in the metadata file.
- The domain of recording texts is diversified, including interactive Q&A, music search, SNS messages, home command and control, etc.
- Segmented transcripts are also provided.