许可协议: CC BY-SA 4.0
Traditional Chinese Universal Dependencies Treebank annotated and converted by Google.
Tokenization and Word Segmentation
- This corpus contains 4997 sentences and 123291 tokens.
- This corpus contains 122962 tokens (100%) that are not followed by a space.
- This corpus does not contain words with spaces.
- This corpus contains 41 types of words that contain both letters and punctuation. Examples: #A, DC-10, km/h, #B, #C, #D, #E, #F, #G, -an, A-AVG, AK-47, Arzacq-Arraziguet, Beaune-Sud, Berne-Belp, CI-7957, CRH380B-002, F-15A, F-16A, Frito-Lay, It's, Kink.com, MD-11, Micro-USM, NX-01, Navy's, O., P-700, Pre-rendering, S-IVB, TVS-5, Tu-16, Uhler-Phillips, al-Banna, f(x), g(x), t.163.com, t.qq.com, t.sina.com.cn, t.sohu.com, t.xxxx.com
Click Here to learn more.
Universal Dependencies (UD) is a framework for consistent annotation of grammar (parts of speech, morphological features, and syntactic dependencies) across different human languages. UD is an open community effort with over 300 contributors producing more than 150 treebanks in 90 languages.