Online and offline Chinese handwriting databases
The online and offline Chinese handwriting databases, CASIA-OLHWDB and CASIA-HWDB, were built by the National Laboratory of Pattern Recognition (NLPR), Institute of Automation of Chinese Academy of Sciences (CASIA). The handwritten samples were produced by 1,020 writers using Anoto pen on papers, such that both online and offline data were obtained. The samples include both isolated characters and handwritten texts (continuous scripts). We collected data from writers from 2007 to 2010, and completed the segmentation and annotation in 2010. The databases include six datasets of online data and six datasets of offline data, in each case, three for isolated characters (DB1.0–1.2) and three for handwritten texts (DB2.0–2.2). In either online or offline case, the datasets of isolated characters contain about 3.9 million samples of 7,356 classes (7,185 Chinese characters and 171 symbols), and the datasets of handwritten texts contain about 5,090 pages and 1.35 million character samples. All the data has been segmented and annotated at character level, and each dataset is partitioned into standard training and test subsets.
Offline Touching Characters Dataset
For assessing touching character segmentation algorithms, we present a database of touching characters collected from the Chinese handwriting database CASIA-HWDB, called CASIA-HWDB-T. All the touching characters (or strings) are annotated with the character classes, locations of touching points, and auxiliary values like string height (LH) and average stroke width (SW).
According to different language types, we partition the touching strings into four subsets: 2,788 all-digit strings (HWDB-T-allDigits), 328 all-letter ones (HWDB-T-allLetters), 50,157 all-Chinese strings (HWDB-T-allChinese), and 3,196 mixed-character ones (HWDB-T-other).
According to the number of characters and touching points, we partition the dataset into three subsets: 48,536 single-touching pairs (HWDB-ST-P), 6, 115 single-touching strings with more than two characters (HWDB-ST-M), and 1,818 multiple-touching pairs (HWDB-MT). More details about the dataset can be found in our paper listed below.
For our handwriting data collection, we compiled a character set based on
the standard sets GB2312-80 and Modern Chinese Character List of
Common Use (Common Set in brief). The GB2312-80 contains 6,763 Chinese
characters, including 3,755 in level-1 set and 3,008 in level-2 set.
The Common Set contains 7,000 Chinese characters. We collected the union of
the two sets, containing 7,170
characters, for possible recognition of practical documents. We further
added 15 Chinese characters that we met in our experience. We also
collected a set of 171 symbols, including 52 English letters, 10
digits, and some frequently used punctuation marks, mathematics and
physical symbols. The total number of character classes is thus 7,356.
For collecting handwritten texts, we asked each writer to hand-copy five texts. We compiled three sets of texts (referred to as versions V1–V3), mostly downloaded from news Web pages except there are five texts of ancient Chinese poems in both V1 and V2. Each set contains 50 texts, each containing 150–370 characters. The three sets were used in different stages of handwriting data collection. The texts in each set were further divided into 10 subsets (referred to as templates T1–T10), each containing five texts to be written by one writer.
We collected handwriting data in three stages using three sets (versions) of templates. Each set has 10 templates to be written by 10 writers. A template has 13–15 pages of isolated characters and five pages of texts. For a template set, the isolated characters are divided into three groups: symbols, frequent Chinese and low frequency Chinese. The symbols are always on the first page, followed by Chinese characters. The first six templates of a set print the same group of frequent Chinese characters in six different orders by rotating six equal parts, and the last four templates print the low frequency Chinese characters in four difference orders. Rotation guarantees that each character is written equally in different time intervals for balanced writing quality. In addition, each template has five pages of different texts. The three sets (versions) of templates are summarized in Table I. V1 and V3 have the same set of isolated characters. The number of isolated Chinese characters in V1 and V3 is actually 7,184, not 7,185, because the templates of V1 were designed earliest. The templates of V3 inherited the isolated character set of V1 and updated the texts. The frequent Chinese character set of V1 and V3 is actually the level-1 set of GB2312-80, which was commonly taken as a standard set of Chinese character recognition research.
The whole dataset
The distribution of templates in (either online or offline) datasets DB1.0-1.2 is shown in Table V, and DB2.0-2.2 have the same partitioning. Compared to isolated characters datasets, the handwritten text dataset OLHWDB2.2 has missing training writer of template V2-T9, and the HWDB2.0 has a missing test writer of template V2-T3. In all databases, the ratio of training writers and test writers is 4:1.
To enable the evaluation of machine learning and classification
algorithms on standard feature data, we provide the feature data of
offline handwriting datasets HWDB1.0 and HWDB1.1, online handwriting
datasets OLHWDB1.0 and OLHWDB1.1. The samples fall in 3,755 classes of
Chinese characters in GB2312-80 level-1 set. The datasets HWDB1.1 and
OLHWDB1.1 (300 writers) are proposed to be used for preliminary
experiments of Chinese character recognition of standard category set.
The datasets HWDB1.0 and OLHWDB1.0 (420 writers) can be added to HWDB1.1
and OLHWDB1.1 for enlarging the training set size.
The feature extraction methods are specified in the reference below, and the results reported there can be used for fair comparison:
C.-L. Liu, F. Yin, D.-H. Wang, Q.-F. Wang, Online and Offline Handwritten Chinese Character Recognition: Benchmarking on New Databases, Pattern Recognition, 46(1): 155-162, 2013.
For offline characters, the feature extracted is the 8-direction histogram of normalization-cooperated gradient feature (NCGF), combined with pseudo 2D normalization method line density projection interpolation. The resulting feature is 512D.
For online characters, the feature extracted is the 8-direction histogram of original trajectory direction combined with pseudo 2D bi-moment normalization. The resulting feature is 512D.
The feature data of each dataset is partitioned into two subsets for training and testing, respectively. The numbers of writers and samples of the files are shown in the Table below.
The format of the feature data files is described below
Character Sample Data
We provide the isolated character datasets HWDB1.0-1.2 (offline) and OLHWDB1.0-1.2 (online) for study of isolated character recognition and the pre-training of classifier for text line recognition. Each dataset is partitioned into a standard training set and a test set of disjoint writers. The format descriptions of offline characters (.gnt) and online characters (.pot) can be found at the bottom, respectively.
We provide the offline text line data (stored in DGRL files, each page contains multiple lines) of HWDB2.0-2.2 and online text line data (stored in WPTT files) of OLHWDB2.0-2.2 for study of text line recognition . The statistics of text lines of each dataset can be found at the bottom. Each dataset is paritioned into sets of pages (text lines) for training and testing.
Competition Test Data
Based on the CASIA-HWDB and CASIA-OLHWDB databases, we organized Chinese Handwriting Recognition competitions in 2010, 2011 and 2013. Now, we open the test data of competition for research. There are four datasets generated by 60 writers: offline character data, online character data, offline text data, online text data. The data format specifications can be found below.
Touching Character Dataset
There are three datasets of isolated characters in the offline handwriting
database. The statistics of these datasets are shown in Table 1. The
datasets include 1,020 files, and each file (.gnt) stores concatenated
gray-scale character images of one writer. The file format of (.gnt) is
specified in Table 2.
HWDB1.0 includes 3,866 Chinese characters and 171 alphanumeric and symbols. Among the 3,866 Chinese characters, 3,740 characters are in the GB2312-80 level-1 set (which contains 3,755 characters in total). HWDB1.1 includes 3,755 GB2312-80 level-1 Chinese characters and 171 alphanumeric and symbols. HWDB1.2 includes 3,319 Chinese characters and 171 alphanumeric and symbols. The set of Chinese characters in HWDB1.2 (3,319 classes) is a disjoint set of HWDB1.0. HWDB1.0 and HWDB1.2 together include 7185 Chinese characters (7, 185=3,866+3,319),which include all of 6763 Chinese characters in GB2312.
The offline text databases were produced by the same writers of the
isolated character datasets. Each person wrote five pages of given
texts. One writer (no.371) and four pages are missing because of data
loss. Each page is stored in a .dgrl file named after the writer index and
page number. In addition to the gray-scale image, the data file also
includes ground-truths of text line segmentation and character class
labels (in GB codes). The statistics of the datasets and the format of (.
dgrl) file are shown in Table 3 and Table 4, respectively.
A DGRL (.dgrl) file stores a page of document image. The image has background eliminated (encoded as 255) and foreground (text strokes) encoded in gray level 0-254, one byte per pixel. Each page is stored as a series of lines. Each line has a header denoting the number of characters, sequence of character codes (GBK), top-left position, line height and width, then the block of bitmap (heightwidth bytes).
For concatenating the lines into page image, it should be noted that different lines may have overlap of plane, because the text strokes of different lines may overlap in vertical axis. So, for restoring the page image, the foreground pixels of different lines should be combined.
There are three datasets of isolated characters in the online database.
The statistics of these datasets are shown in Table 1. The datasets
include 1020 files, and each file (.pot) stores character samples written
by one person. The file format of (.pot) is specified in Table 2.
OLHWDB1.0 includes 3,866 Chinese characters and 171 alphanumeric and symbols. Among the 3,866 Chinese characters, 3,740 characters are in the GB2312-80 level-1 set (which contains 3,755 characters in total). OLHWDB1.1 includes 3,755 GB2312-80 level-1 Chinese characters and 171 alphanumeric and symbols. OLHWDB1.2 includes 3,319 Chinese characters and 171 alphanumeric and symbols. The set of Chinese characters in OLHWDB1.2 (3,319 classes) is a disjoint set of OLHWDB1.0.
OLHWDB1.0 and OLHWDB1.2 together include 7185 Chinese characters (7, 185=3,866+3,319),which include all of 6763 Chinese characters in GB2312.
The online handwritten text datasets were produced by the same writers of
the isolated character datasets. Each person wrote five pages of given
texts. One writer (no.671) and three pages (2 pages of no.328 and 1 page
of no.685) are missing because of data loss. Each page is stored in a .
wptt file named after the writer index and page number. In addition to
the stroke trajectory data of the page, the data file also includes
ground-truths of text line segmentation and character class
labels (text line transcript in GB codes). The statistics of the
datasets and the format of (.wptt) file are shown in Table 3 and Table