摘要
建立公开、权威的蒙古文手写数据库是研究和开发蒙古文手写识别系统的基础。该文在蒙古文编码、构词和语法的研究基础上,公开了一个蒙古文大词汇量脱机手写数据库MHW,其中训练集由5 000个单词构成,每个词采集了20个样本,共包含10万样本,测试集Ⅰ包含5 000样本,测试集Ⅱ包含14 085样本。该文利用蒙古文文字长度可变特征研究了自动错误检测算法,提高了字库的可靠性。在三种常用手写识别模型上评估了字库的性能,其中基于循环神经网络的模型表现出最佳性能,在字典受限条件下测试集Ⅰ的词错误率达到2.20%,测试集Ⅱ达到了5.55%。
A public well-recognized Mongolian offline handwritten database is the basis for the research and development of Mongolian handwriting recognition system.Based on the research on Mongolian coding,word formation and grammar,a large-vocabulary Mongolian offline handwritten database(MHW)is constructed,which contains 100000 pieces of Mongolian words,i.e.20 samples for each of 5000 words.The test set I contains 5000 samples and test set II contains 14085 samples.An automatic error detection algorithm is applied,which is based on the variable length of each Mongolian word.The performance of MHW is validated on three propular handwriting recognition models,among which the Recurrent Neural Network based model shows best performance of 2.20% on test set I and 5.55% on test set II with constrained dictionary.
作者
范道尔吉
高光来
武慧娟
FAN Daoerji;GAO Guanglai;WU Huijuan(College of Computer Science, Inner Mongolia University, Hohhot, Inner Mongolia 010021, China;College of Electronic Information Engineering, Inner Mongolia University, Hohhot, Inner Mongolia 010021, China)
出处
《中文信息学报》
CSCD
北大核心
2018年第1期89-95,共7页
Journal of Chinese Information Processing
基金
内蒙古自治区自然科学基金(2016MS0603)