摘要
自动文档识别中字切分算法如果仅仅依靠大小位置等度量信息,很容易产生误切分图像块,需要字符分类器给出一定的反馈才能准确切分,为此提出了一个新的拒识算法,目标是尽可能准确地拒识非法字符。该文分析了基于距离的分类器的置信度和广义置信度,在此基础上改进了常用的广义置信度映射函数,并设计了一个基于样本学习的拒识规则,提高了拒识算法的适应性。在中日韩三种文档样本上的实验表明,该文算法明显改善了系统性能,对于较低质量的印刷文本识别具有一定的普遍意义。
In OCR systems the character segmentation algorithm may generate missegmented blocks,especially when us-ing only geometric measure information such as size and location.Feedback information from character classifier is nec-essary to achieve higher character segmentation accuracy.In this paper a novel rejection algorithm is proposed to reject these invalid characters more accurately.First,the confidence and generalized confidence of distance-based classifiers are analyzed,and then usual generalized confidence mapping function is modified.A new sample-based rejection rule is also proposed,which is more adaptive and flexible.Experiments on Chinese,Japanese and Korean document recognition show that new rejection algorithm evidently improved the system performance,especially for low-quality printed document recognition.
出处
《计算机工程与应用》
CSCD
北大核心
2002年第17期69-72,共4页
Computer Engineering and Applications
基金
国家863高技术研究发展计划(编号:2001AA114081)
国家自然科学基金(编号:69972024)