中文电子文档中数学公式的语义识别方法研究

Research on Semantic Recognition for Mathematical Formula in Digital Chinese Documents

下载PDF

导出

摘要中文电子文档中数学公式结构复杂且含有大量特殊符号,针对目前OCR技术难以高效识别数学公式,提出了一种新的公式语义识别方法.首先结合字符宽度中心矩和汉字拒识法对公式进行两次定位,然后利用投影法和连通域法切分公式字符,提取字符孔洞数、穿越线等特征构建字符模板库,利用模板匹配方法识别公式中各字符,接着基于五类特征字符的特点,建立后标型、包含型和独立型等七种字符块合并规则以分析公式结构、还原公式的语法含义,最后将公式结构分析结果以EQ域语法串的形式输出.实验结果表明,本文方法可以有效地对中文电子文档中的数学公式进行语义分析. As a result of the complex structure and large numbers of special symbols in mathematical formulas,it＇s difficult for OCR technology to recognize formulas efficiently from digital Chinese documents at present. In view of this, a novel formula semantic recognition approach is proposed. Firstly, the paper locates formulas twice using both central moment of character width and rejection of Chinese character method. Then, the projection method and connected domain are put forward on character segmentation, and hole number, traversing line and some other features are extracted from characters to create character template library, characters in the formula are recognized by template matching. Next,in order to analyze the structure and grammatical meaning of the formula,seven combination rules are established based on the characteristics of five kinds of characters. Finally, structural analysis results output in the form of EQ domain syntax string. Experimental results show that the proposed method can realize semantic analysis for mathematical formulas in digital Chinese document effectively.

作者王高王培珍杜培明王爱芳张自强

机构地区安徽工业大学电气与信息工程学院

出处《小型微型计算机系统》 CSCD 北大核心 2017年第10期2379-2384,共6页 Journal of Chinese Computer Systems

基金国家自然科学基金项目(51574004)资助

关键词数学公式定位字符识别结构分析语义识别 mathematical formula location character recognition structural analysis semantic recognition

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献3

1张灿龙,唐艳平,王强,韦春荣.一种印刷体数学公式优化提取策略[J].计算机工程与应用,2010,46(15):146-149. 被引量：1
2林晓燕,高良才,汤帜.中文电子文档的数学公式定位研究[J].北京大学学报（自然科学版）,2014,50(1):17-24. 被引量：4
3宗亚辉,李双庆.印刷体数学公式的结构分析与识别[J].计算机工程与应用,2015,51(9):196-200. 被引量：3

二级参考文献42

1王科俊,王黎斌,林桂芳.科技文献中数学公式定位技术概述[J].自动化技术与应用,2004,23(5):1-4. 被引量：3
2靳简明,江红英,王庆人.数学公式图像处理综述[J].模式识别与人工智能,2005,18(4):429-440. 被引量：7
3靳简明,江红英,王庆人.数学公式识别系统:MatheReader[J].计算机学报,2006,29(11):2018-2026. 被引量：13
4Chaudhuri B B,Garain U.An approach for recognition and interpretation of mathematical expressions in printed docum-ent[J].Pattern Analysis & Applications 2000,3:120-131.
5Garain U,Chaudhuri B B,Chaudhuri A R.Identification of embedded mathematical expressions in scanned documents[C]//The 17th International Conference on Pattern Recognition.Washington DC: IEEE Computer Society,2004,1:384-387.
6Tian Xue-Dong,Li Hai-Yan,L Xin-Fu.Research on symbol recognition for mathematical expressions[C]//The 1st International Conference on Innovative Computing,lnformation and Control.Washington DC:IEEE Computer Society,2006,3:357-360.
7Basu S,Chaudhuri C.Text line extraction from multi-skewed handwritten documents[J].Pattem Recognition Society,2007:1825-1839.
8Jin J,Han X,Wang Q.Mathematical formulas extr-action[C]//Proc of the 7th International Conference on Document Analysis and Recognition, Edinburgh, Scotland, 2003 : 1138-1141.
9Theodoridis S.Pattern recognition[M].4th ed.American:Academic Press, 2009.
10Leandro N C,Zuben F J V.Learning and optimization using the clone selection principle[J].IEEE Transactions on Evolutionary Computation, 2002,6 ( 3 ) : 239-251.

共引文献4

1曹逸峰,陈晓伟.基于知识分层提取模型的服务台知识库建设[J].计算机系统应用,2015,24(2):261-265. 被引量：3
2陈泽鸣.基于服务单短文本的运维知识提取模型研究[J].中小学电教（下）,2018,0(9):26-27.
3付鹏斌,彭荆旋,杨惠荣,李建君.基于多重几何特征和CNN的脱机手写算式识别[J].计算机系统应用,2020,29(8):271-279. 被引量：2
4孙静.基于公式识别器的PDF图像数学公式定位结果错误校正方法[J].兰州工业学院学报,2020,27(6):78-82.

1李映华.切分(拼)圆柱的三种基本方法[J].数学小灵通（小学中高年级班）,2003(5):13-14.
2于健.高校档案管理工作中的电子档案管理[J].办公室业务,2017(18):144-144.
3王剑.一种新的货运列车车号定位方法[J].微型电脑应用,2017,33(9):32-35. 被引量：3
4胡珉,冯俊兰,王燕蒙,闪云香.中国移动智能客服系统研究及实现[J].电信工程技术与标准化,2017,30(10):39-44. 被引量：4
5陶智.用SIM808模块收取中文短信[J].无线电,2017,0(10):20-22.
6史兆鹏,邹徐熹,向润昭.基于依存句法分析的多特征词义消歧[J].计算机工程,2017,43(9):210-213. 被引量：12
7闫鑫.基于FFT的双向电能计量算法研究[J].通信电源技术,2017,34(4):18-19.
8姜斌,刘淑慧,王文玉,高军.白矮主序双星光谱的参数自动测量研究[J].光谱学与光谱分析,2017,37(9):2914-2918.
9梁国华.试论中小学电子注册学籍档案管理的建议[J].求知导刊,2017(23):56-56.
10朱玲慧,周新雅,邹芳,吕露.巧用Word自带的绘图工具绘制物理电学矢量图形[J].物理通报,2017,46(10):96-97. 被引量：4

小型微型计算机系统

2017年第10期

浏览历史

内容加载中请稍等...

中文电子文档中数学公式的语义识别方法研究

参考文献3

二级参考文献42

共引文献4

相关作者

相关机构

相关主题

浏览历史