摘要
利用上下文关系进行汉字识别后处理时 ,若候选字集中不含有正确字符 ,文本识别率的提高会受到很大限制。基于单字识别系统的噪声信道模型 ,文中提出一种扩充候选字集的方法 ,利用单字识别给出的候选字来推测可能正确的字 ,并将它们与识别候选字进行集成 ,得到新的候选字集。30 0套脱机手写体样本的测试表明 ,新候选字集的 5 0选平均错误率较原先的识别候选字集下降了 37.88%。脱机手写体文本 (约 8万字 )识别后处理中 ,语言模型为基于字的bigram时 ,文本平均识别率从扩充候选字之前的 93.93%提高至 95 .82 % ,错误率下降了 31.14%。
In Chinese document recognition incorporating post processing, the document recognition rate is limited if the candidate sets do not cantain any correct characters. The noisy channel model is used to develop a method for expanding the candidate sets. The method uses the original candidates given by the recognizer to conjecture the most likely correct characters and then combines them with the original candidates to produce new candidate sets. In a test with 300 off line handwritten samples, the top 50 candidates of the new candidate sets achieved 37.88% average error reduction rate in comparison with the original candidate sets. Using the character based bigram language model, and after expanding the candidate sets using the method proposed here, the average recognition rate for off line handwritten Chinese documents (about 80,000 characters) is 95.82%, compared with the average recognition rate of 93.93% without candidates sets expansion. On average, a 31.14% error reduction rate is achieved.
出处
《清华大学学报(自然科学版)》
EI
CAS
CSCD
北大核心
2001年第1期24-28,共5页
Journal of Tsinghua University(Science and Technology)
基金
国家"八六三"高技术计划项目!(86 3-30 6 -0 3-0 5 -6 )
国家自然科学基金资助项目!(6 96 82 0 0 3)
关键词
汉字识别
后处理
语言模型
扩充候选字
噪声信道
方案集成
通信系统
Chinese character recognition
post processing
language model
candidate set expansion
noisy channel model
combination