摘要
当前,已经有大量为单一字符集(或语种)而设计的OCR(optical character recognition)分类器.同时,随着全球一体化,多语文档的出现越来越普遍.因此,设计多语文档处理系统势在必行.提出了一般性的解决方案:两项OCR技术、一个系统和语言判断.为了使研究工作具体化,实现了一个中英文混合文章处理系统.其中主要涉及了3个关键问题:系统流程控制、汉英语言区域分离和英文字符切分.与以往的系统相比,该系统增加了汉英语言区域分离模块,并将基于等间距性的新方法应用于该模块.为了验证本系统的有效性,综合以往的方法实现了另一个系统.实验结果表明,该系统的性能明显优于另一个系统,在杂志样和书籍样上的识别率分别从98.48%和98.68%提高到99.13%和99.25%.
Currently, OCR (optical character recognition) classifiers are generally designed for one character set (or language). On the other hand, multilingual document increases drastically due to the globalization. Therefore, designing a document processing system with multilingual capability is very important. A general scheme is presented in this paper: two OCR techniques, a system, and a language classification. For embodying the scheme, a Chinese/English mixed document processing system is implemented. Three key problems are considered: the control of the system flow, the classification of Chinese/English regions, and the segmentation of English characters. Compared with old systems presented in other papers, the module of the classification of Chinese/English regions is added in the system, and a novel approach based on the equidistance is applied to the module. To verify the effectiveness of the system, another system is implemented according to the methods presented in other papers. Experiment shows, the new system is more effective than the old system. The recognition rate increases from 98.48% to 99.13% on magazine samples and from 98.68% to 99.25% on book samples, respectively.
出处
《软件学报》
EI
CSCD
北大核心
2005年第5期786-798,共13页
Journal of Software
基金
国家自然科学基金天元基金~~