摘要
中文电子文档中数学公式结构复杂且含有大量特殊符号,针对目前OCR技术难以高效识别数学公式,提出了一种新的公式语义识别方法.首先结合字符宽度中心矩和汉字拒识法对公式进行两次定位,然后利用投影法和连通域法切分公式字符,提取字符孔洞数、穿越线等特征构建字符模板库,利用模板匹配方法识别公式中各字符,接着基于五类特征字符的特点,建立后标型、包含型和独立型等七种字符块合并规则以分析公式结构、还原公式的语法含义,最后将公式结构分析结果以EQ域语法串的形式输出.实验结果表明,本文方法可以有效地对中文电子文档中的数学公式进行语义分析.
As a result of the complex structure and large numbers of special symbols in mathematical formulas,it's difficult for OCR technology to recognize formulas efficiently from digital Chinese documents at present. In view of this, a novel formula semantic recognition approach is proposed. Firstly, the paper locates formulas twice using both central moment of character width and rejection of Chinese character method. Then, the projection method and connected domain are put forward on character segmentation, and hole number, traversing line and some other features are extracted from characters to create character template library, characters in the formula are recognized by template matching. Next,in order to analyze the structure and grammatical meaning of the formula,seven combination rules are established based on the characteristics of five kinds of characters. Finally, structural analysis results output in the form of EQ domain syntax string. Experimental results show that the proposed method can realize semantic analysis for mathematical formulas in digital Chinese document effectively.
出处
《小型微型计算机系统》
CSCD
北大核心
2017年第10期2379-2384,共6页
Journal of Chinese Computer Systems
基金
国家自然科学基金项目(51574004)资助
关键词
数学公式定位
字符识别
结构分析
语义识别
mathematical formula location
character recognition
structural analysis
semantic recognition