摘要
文章讨论了设计一个实用的多体英文识别系统中解决的主要问题。该系统能识别多达260种字体,包括斜体和黑体等字体,对训练集的识别率达到99%,对实际文本测试的错误率比TH-OCR2000低56%。文章详细阐述了文本行字切分,特征提取和分类器设计,以及后处理所使用的常用技术,对各种技术的特点进行了分析和比较,并提出了一些新的技术。文章对于OCR系统的设计具有一定的指导意义。
This paper addresses the main problems in designing a multi-font English character recognition system.The system can recognize more than260kinds of fonts,including italic font and black font.The recognition ratio in training set is99%,and the error recognition ratio in real-world documents is56%lower than TH-OCR2000.Techniques of text line segmentation and character segmentation,feature extraction and classifier design,and post-processing are discussed in detail.Characteristics of techniques are analyzed and compared.Some novel techniques are provided in the paper.This paper can be used as guidance for OCR system design.
出处
《计算机工程与应用》
CSCD
北大核心
2001年第20期120-122,共3页
Computer Engineering and Applications
基金
国家863高技术计划(编号:863-306-ZT03-03-1)
国家自然科学基金(编号:69972024)
关键词
多体印刷英文识别系统
分类器
特征提取
字符切分
OCR,Character Segmentation,Feature Extraction,Classifier Design,Post-Processing