摘要
从Postscript格式的科技文献中提取识别数学表达式,是数学表达式识别领域的一个新的研究方向。主要针对以Word和Latex为生成源的PS文档,提出了基于内容的数学表达式提取方法。首先重载了PS语言中的一些相关命令,以提取PS文档中的字符与线段信息;之后根据字符名称、字体、位置等信息对字符进行分析,同时连接线段并加以识别,从而提取出数学符号;最后,根据符号问的空间位置关系和启发式规则,将数学符号归并,提取出最终的表达式。实验结果表明该方法正确率达到98.56%。
A content-based approach to mathematical expressions extraction from Postscript documents is presented. The current study objects are Postscript documents generated by Microsoft Word or Latex. Firstly, some relevant orders in PS language are redefined to extract character and line information. Then, the name, font type and location of characters are analyzed. The connected lines are recognized, and the mathematical characters are extracted. Finally, heuristic rules are used to merge mathematics into expressions. The method proposed is proved to have high accuracy by experiments.
出处
《计算机应用与软件》
CSCD
北大核心
2008年第11期157-159,162,共4页
Computer Applications and Software