摘要
为了从中英文混排的中文文档中定位数学公式,提出了一种基于中文字符识别和公式符号识别的数学公式定位方法。该方法主要由中文字符提取、内嵌公式提取和独立公式定位三个部分组成。在中文字符提取中,首先提取字符块信息:中文字符识别结果、公式符号识别结果和字符块的几何特征,然后使用决策树的方法区分中文字符和非中文字符。在内嵌公式提取中,使用公式符号的语义信息、符号间的角标关系和公式的语义信息等从非中文字符中定位内嵌公式。在独立数学公式定位中,对包含较多内嵌公式符号且不包含中文字符的文字行提取版式结构特征,并使用高斯混合模型区分独立公式和普通文字行。在148幅文档图像共包含3 690个公式组成的测试集上取得了91.19%的公式定位正确率。
In order to extract mathematical expressions (MEs) in scanned Chinese document, a ME identification method based on Chinese character recognition and ME symbol recognition is proposed. In this paper, Chinese blocks are firstly deleted based on a decision tree using the features from Chinese character recognition result, ME symbol recognition result and character's geometric information. Then the embedded MEs are extracted from non- Chinese character blocks based on semantic information of ME, syntax information and script relation between adjacent blocks. Finally, the isolated MEs without Chinese blocks are identified for embedded ME symbols by Gaussian Mixture Model. The experiments were carried on a dataset with 148 document images containing 3690 MEs, and the results show that the proposed method reaches 91.19 % in the ME identification accuracy.
出处
《中文信息学报》
CSCD
北大核心
2008年第4期83-87,共5页
Journal of Chinese Information Processing
基金
国家863高技术研究发展计划资助项目(2006AA01Z153)
关键词
人工智能
模式识别
中文文档
字符识别
数学公式
高斯混合模型
artificial intelligence
pattern recognition
Chinese document
character recognition
mathematical expression
Gaussian mixture model