一种中文文档的数学公式定位方法

An Identification Method for Mathematical Expressions in Scanned Chinese Document

下载PDF

导出

摘要为了从中英文混排的中文文档中定位数学公式,提出了一种基于中文字符识别和公式符号识别的数学公式定位方法。该方法主要由中文字符提取、内嵌公式提取和独立公式定位三个部分组成。在中文字符提取中,首先提取字符块信息:中文字符识别结果、公式符号识别结果和字符块的几何特征,然后使用决策树的方法区分中文字符和非中文字符。在内嵌公式提取中,使用公式符号的语义信息、符号间的角标关系和公式的语义信息等从非中文字符中定位内嵌公式。在独立数学公式定位中,对包含较多内嵌公式符号且不包含中文字符的文字行提取版式结构特征,并使用高斯混合模型区分独立公式和普通文字行。在148幅文档图像共包含3 690个公式组成的测试集上取得了91.19%的公式定位正确率。 In order to extract mathematical expressions （MEs） in scanned Chinese document, a ME identification method based on Chinese character recognition and ME symbol recognition is proposed. In this paper, Chinese blocks are firstly deleted based on a decision tree using the features from Chinese character recognition result, ME symbol recognition result and character＇s geometric information. Then the embedded MEs are extracted from non- Chinese character blocks based on semantic information of ME, syntax information and script relation between adjacent blocks. Finally, the isolated MEs without Chinese blocks are identified for embedded ME symbols by Gaussian Mixture Model. The experiments were carried on a dataset with 148 document images containing 3690 MEs, and the results show that the proposed method reaches 91.19 % in the ME identification accuracy.

作者郭育生谭怒涛黄磊刘昌平

机构地区中国科学院自动化研究所

出处《中文信息学报》 CSCD 北大核心 2008年第4期83-87,共5页 Journal of Chinese Information Processing

基金国家863高技术研究发展计划资助项目(2006AA01Z153)

关键词人工智能模式识别中文文档字符识别数学公式高斯混合模型 artificial intelligence pattern recognition Chinese document character recognition mathematical expression Gaussian mixture model

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献12

1K.-F. Chan and D.-Y. Yeung. Mathematical expression recognition:a survey [J]. Int. J. Doc. Anal. Recog, 2000, 3(1):3-15.
2Richard J. Fateman and Taku Tokuyasu. Progress in recognizing typeset mathematics [C]//Proceedings of the SPIE. San Jose, CA: 1996, 2660, 37-50.
3Hsi-Jian Lee and Jiumn-Shine. Wang. Design of a Mathematical Expression Recognition System[C]//ICDAR'95. Montr al, Canada:1995,1084-1087.
4Hsi-Jian Lee and Jiumn-Shine. Wang. Design of a Mathematical Expression Recognition System [J]. Pattern Recognition Letters, 1995, 18:289-298.
5Richard J. Fateman. How to Find Mathematics on a Scanned Page[C]//Proc. SPIE. 1999,3967 : 98-109.
6J-.Y. Toumit, S. Garcia-Salicetti,H. Emptoz. A Hierarchical and Recursive Model of Mathematical Expressions for Automatic Reading of Mathematical Documents[C]//ICDAR' 99, Bangalore, India: 1999: 119-122.
7Jianming Jin, Xionghu Han, Qingren Wang. Mathematical Formulas Extraction[C]//Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003), 1138-1141.
8田学东,杨捧,张立平,苗秀芬.印刷文档中数学公式抽取的研究[J].河北大学学报（自然科学版）,2005,25(5):545-548. 被引量：1
9B.B. Chaudhuri and U. Garain. An Approach for Recognition and Interpretation of Mathematical Expressions inPrinted Document [C]//Pattern Analysis and Applications(PAA), 2000, 3: 120-131.
10Utpal Garain, B. B. Chaudhuri, A. Ray Chaudhuri. Identification of Embedded Mathematical Expressions in Scanned Documents [C]//ICPR'04, 2004, 1 : 384- 387.

二级参考文献15

1HSK-JIAN LEE, JIUMN-SHINE WANG. Design of a Mathematical Expression Recognition System[Z]. In Proceedngs of the Third International Confenena on Document Andysts and Recognition.Canada, 1995:1084-1087.
2TOUMIT J-Y,GARCIA-SALICETTI S,EMPTOZ H. A Hiearachical and RECURSIVE Model of Mathematical Expressions for Automatic Reading of Mathematical Documents[Z]. In Proceedings of ICDAR'99, India, 1999.
3FATEMAN R,TOKUYASU T, BERMAN B.Optical character recognition and parsing of typeset mathematics[J]. Journal of Visual Communication and Image Representation,1996, 7(1): 2-15.
4KACEM A,BELAID A,AHMED M BEN. Automatic extraction of printed mathematical formulas using fuzzy logic and propagation of context[J].Internatrond Jurnal of Documant Analysrs and Recogrrltron,2001(4): 97-108.
5RICHARDODUDA PETEREHART.DAVIDGSTORK,模式分类[M].北京:机械工业出版社,2003.134-174.
6边肇祺张学工.模式识别[M].北京：清华大学出版社,1999.282-283.
7H.J.Lee,J.S.Wang.Design of a mathematical expression recognition system[A].In:Proceedings of 3rd International Conference on Document analysis and Recognition[C].ICDAR'95,Montréal,Canada,1995.464-468.
8Richard J.Fateman.How to Find Mathematics on a Scanned Page[R].Technical Report,1996.
9K.Inoue,R.Miyazaki,M.Suzuki.Optical Recognition of Printed Mathematical Documents[A].In:Proceedings of the Third Asian Technology Conference in Mathematics[C].Springer-Verlag,1998.280-289.
10A.Kacem,A.Belaid,M.Ben Ahmed.EXTRAFOR:automatic EXTRAction of mathematical FORmulas[A].In:Proceedings of 5th International Conference on Document analysis and Recognition[C].ICDAR'99,Bangalore,India,1999.527-530.

共引文献3

1张志伟,孔凡让,吴欣.Postscript格式科技文献中数学表达式的提取方法[J].计算机应用与软件,2008,25(11):157-159. 被引量：4
2李冬睿,许统德.一种印刷体文档内嵌数学公式提取方法的研究[J].计算机应用与软件,2014,31(4):102-105.
3徐晓宇,宗亚辉,胡欣宇.科技文档中数学表达式的结构分析与识别[J].物联网技术,2016,6(11):98-100.

1林晓燕,高良才,汤帜.中文电子文档的数学公式定位研究[J].北京大学学报（自然科学版）,2014,50(1):17-24. 被引量：4
2王科俊,王黎斌,林桂芳.科技文献中数学公式定位技术概述[J].自动化技术与应用,2004,23(5):1-4. 被引量：3
3刘济林,宋加涛,丁莉雅,马洪庆,李培弘.高性能的车牌识别系统(英文)[J].自动化学报,2003,29(3):457-465. 被引量：21
4李冬睿,许统德.一种印刷体文档内嵌数学公式提取方法的研究[J].计算机应用与软件,2014,31(4):102-105.
5陈峰,郑春光.印刷体文档中的数学公式识别方法综述[J].信息技术,2009,33(3):117-120. 被引量：1
6贾保华.基于未知度的Vague集相似度量新方法[J].昆明理工大学学报（理工版）,2010,35(5):112-117. 被引量：5
7靳简明,江红英,王庆人.数学公式识别系统:MatheReader[J].计算机学报,2006,29(11):2018-2026. 被引量：13
8靳简明,江红英,王庆人.数学公式图像处理综述[J].模式识别与人工智能,2005,18(4):429-440. 被引量：7
9韩金涛.基于数据库的大量电子表格的自动填写[J].微型机与应用,2014,33(12):12-14.
10杨子,栾翠菊.基于数据挖掘的微博突发事件检测的算法研究[J].现代计算机（中旬刊）,2016(6):28-32. 被引量：1

中文信息学报

2008年第4期

浏览历史

内容加载中请稍等...

一种中文文档的数学公式定位方法

参考文献12

二级参考文献15

共引文献3

相关作者

相关机构

相关主题

浏览历史