摘要
设计一种基于多引擎的印刷体汉字识别系统,优先采用汉王光学字符识别(OCR)引擎的版面分析结果,在汉王、清华OCR引擎分别完成字符识别之后,根据字符的图像坐标,整合两者的识别结果,并用彩色突出两OCR引擎的冲突字符、置信度低的字符及WiseCheck语义校对引擎提示的错误字符。该系统改善了现有大规模数字化加工生产线中人工比照图像时对识别文本逐字、全文遍历式校对的工作模式,能减轻劳动强度,提高工作效率,降低处理成本。
A printed Chinese characters recognition system based on multi-engine has been constructed.Basing on the HW-OCR engine's layout analysis,the HW-OCR and TH-OCR engines accomplished character recognition respectively.According to the coordinate of the character image,the system will integrate the two OCR engine's recognition results using different colors to highlight their conflict character and low confidence character,and the other wrong words which are checked by the "WiseCheck"(a semantic collation engine).This system has improved the text verbatim identification by artificial contrast image and full-text search proofreading work mode in the existing mass digitization processing production line,which further can reduce labor intensity,improve work efficiency and reduce the cost of processing.
出处
《广西科学院学报》
2011年第4期317-319,共3页
Journal of Guangxi Academy of Sciences
关键词
汉字识别
光学字符识别
语义校对
多引擎
Chinese character recognition
OCR
semantic collation
multiengine