期刊文献+

基于融合特征与语法规则的流式文档理解方法 被引量:1

Reflowable document comprehension method based on fusion features and grammar rules
下载PDF
导出
摘要 针对流式文档结构理解中构件识别特征分析的不足,提出一种基于融合特征的构件识别方法。首先建立格式向量表示字体等构件格式特征,提取文档构件中关键字等内容特征作为内容向量,分别计算待识别构件两种特征与候选构件的得分并对其加权计算,得出候选的构件标签;结合自顶向下和自底向上的结构识别方法,得到文档的逻辑结构。通过实验验证了该方法能有效提高文档构件识别的准确率,同时提高了文档结构识别的准确率。 In order to achieve adequate component identification in reflowable document structure understanding, a new method to understand documents is proposed based on fusion features and grammatical rules. Two vectors are used in the method. One is the format vector representing the format features, such as fonts;the other is the content vector representing text features such as keywords. Then the components to be identified are compared with the candidates by measuring the distance between the vectors with different weights. Finally, based on the candidate labels and grammatical rules, the logic structure of the document is recognized by applying the top-down and bottom-up algorithm. The experiment results show that this method can effectively improve the accuracy of component identification, and in turn improve the accuracy of whole document structure recognition.
作者 郝海利 李宁 田英爱 耿思 HAO Haili;LI Ning;TIAN Ying′ai;GENG Si(Computer School,Beijing Information Science&Technology University,Beijing 100101,China)
出处 《北京信息科技大学学报(自然科学版)》 2019年第1期49-54,共6页 Journal of Beijing Information Science and Technology University
基金 国家重点研发计划项目(2018YFB1004100) 国家自然科学基金资助项目(61672105)
关键词 文档结构理解 文档识别 流式文档 document structure comprehension document identification reflowable document
  • 相关文献

参考文献8

二级参考文献58

  • 1陈国胜,何宗明.基于XML技术的Word文档录入及格式检测系统设计[J].计算机时代,2009(4):35-37. 被引量:7
  • 2李洁,高新波,焦李成.基于特征加权的模糊聚类新算法[J].电子学报,2006,34(1):89-92. 被引量:113
  • 3潘世言.[D].北京:清华大学,1999.
  • 4Dave Thomas,Anay Hunt.版本控制之道-使用CVS[M].陈伟柱,等译.北京:电子工业出版社,2005.
  • 5[美]Mason M.版本控制之道-使用Subversion[M].陶文译.北京:电子工业出版社,2007.
  • 6Zhai Y H, Liu B. Web data extraction based on partial tree alignment//Proceedings of the 14th International Conference on World Wide Web. Chiba, Japan, 2005:76-85.
  • 7Chang C H, Kayed M, Girgis M R, Shaalan K. A survey of web information extraction systems. IEEE Transactions on Knowledge and Data Engineering, 2006, 18(10) : 1411-1428.
  • 8Creseenzi V, Mecca G, Merialdo P. Roadrunner: Towards automatic data extraction from large web sites//Proceedings of the Very Large DataBase. Roma, Italy, 2001 : 109-118.
  • 9Nie Zai-Qing, Wen Ji-Rong, Ma Wei-Ying. Webpage understanding: Beyond page-level search. SIGMOD Record, 2008, 37(4):48-54.
  • 10Wong Tak-Lam, Lam Wai. Learning to adapt web information extraction knowledge and discovering new attributes via a Bayesian approach. IEEE Transactions on Knowledge and Data Engineering, to appear.

共引文献68

同被引文献4

引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部