Text in the wild
2D Box
Text
OCR/Text Detection
|...
许可协议: CC BY-NC-SA 4.0

Overview

We provide details of a newly created dataset of Chinese text with about 1 million Chinese characters from 3850 unique ones annotated by experts in over 30000 street view images.
This is a challenging dataset with good diversity containing planar text, raised text, text under poor illumination, distant text, partially occluded text, etc.

Data Annotation

Overall information file (../data/annotations/info.json) is UTF-8 (no BOM) encoded JSON.
The data struct for this information file is described below.

information:
{
    train: [image_meta_0, image_meta_1, image_meta_2, ...],
    val: [image_meta_0, image_meta_1, image_meta_2, ...],
    test_cls: [image_meta_0, image_meta_1, image_meta_2, ...],
    test_det: [image_meta_0, image_meta_1, image_meta_2, ...],
}

image_meta:
{
    image_id: str,
    file_name: str,
    width: int,
    height: int,
}

train, val, test_cls, test_det keys denote to training set, validation set, testing set for classification, testing set for detection, respectively.
The resolution of each image is always 2048×2048.
Image ID is a 7-digits string, the first digit of image ID indicates the camera orientation in the following rule.

'0': back
'1': left
'2': front
'3': right

The file_name filed doesn't contain directory name, and is always image_id + '.jpg'.
More information about data annotation could be found here

Citation

@article{yuan2019ctw,
  author  = {Tai{-}Ling Yuan and Zhe Zhu and Kun Xu and Cheng{-}Jun Li and Tai{-}Jiang Mu and Shi{-}Min Hu},
  title   = {A Large Chinese Text Dataset in the Wild},
  journal = {Journal of Computer Science and Technology},
  volume  = {34},
  number  = {3},
  pages   = {509--521},
  year    = {2019},
}

License

CC BY-NC-SA 4.0

数据概要
数据格式
Image,
数据量
--
文件大小
24.84GB
发布方
TSINGHUA UNIVERSITY - TencentJoint Laboratory
Since its establishment in 2010, the joint lab has been taking “scientific research cooperation”, “personnel training” and “academic exchange” as the main directions.
数据集反馈
立即开始构建AI