期刊文献+

基于DTA的信息抽取技术研究

ON INFORMATION EXTRACTION TECHNIQUE BASED ON DTA
下载PDF
导出
摘要 针对现有基于网页结构信息抽取技术的不足,提出一种基于确定性树自动机DTA(deterministic tree automaton)的信息抽取技术。其核心思想是通过将HTML文档转换成二叉树的形式,然后依据树自动机对待抽取网页的接收和拒绝状态进行数据的抽取。该方法充分利用了HTML文档的树状结构。依托树自动机将传统的以单一结构途径的信息抽取与文法推理两者相结合。经实验证明与同类抽取方法相比在准确率、召回率以及抽取所需时间上均有所提高。 In light of the deficiency of existing information extraction techniques based on webpage structure, in this paper we propose an information extraction technique based on deterministic tree automata (DTA). The core idea of it is to transform the HTML document to binary tree, and then extract the data according to the acceptance and rejection state of DTA on its webpage extraction. The method makes the full use of tree structure of HTML documents and combines conventional information extraction in single structure route with grammar inference relying on DTA. Experimental results shows that the approach with DTA improves the precision, recall rate and time of extraction comparing with other similar extraction methods.
出处 《计算机应用与软件》 CSCD 2009年第12期228-230,250,共4页 Computer Applications and Software
关键词 树自动机 信息抽取 HTML Tree automata Information extraction HTML
  • 相关文献

参考文献8

  • 1Xiaofeng Meng, Hongjun Lu Ilaiyan, Wang Mingzhe Gu. SG-WRAP: a schema-guided wrapper generator [ C ]//18th lnternation Conference, 2002:331 - 332.
  • 2李效东,顾毓清.基于DOM的Web信息提取[J].计算机学报,2002,25(5):526-533. 被引量:101
  • 3Arnaud Sahuguet,Fabien Azavant. Building Light-weight Wrappers for Legacy Web Data-sources Using W4F [C]//International Conference on Very Large Databases, Edinburgh, Scot and, 1999:738 - 741.
  • 4王磊,蒋建中,郭军利.基于扩展DOM树的Web页面信息抽取[J].计算机应用与软件,2007,24(6):137-139. 被引量:12
  • 5陈琼,苏文健.基于网页结构树的Web信息抽取方法[J].计算机工程,2005,31(20):54-55. 被引量:24
  • 6Ricojuan J, Calera Rubio J, CatTasco R. Probabilistic k-testable tree-languages [ C ]//Proceedings of 5^th' international Colloquium, ICGI 2000,Lisbon( Portugal ), volume 1891 of Lecture Notes in Computer Science, pages 221 -228. Springer,2000.
  • 7Kosala R. Information extraction by tree automata inference [ D ]. Belgium : Katholieke University ,2003.
  • 8王茹,宋瀚涛,陆玉昌.基于树自动机的网页数据抽取[J].北京理工大学学报,2004,24(9):790-793. 被引量:6

二级参考文献39

  • 1Florescu D, Levy A Y, Mendelzon A. Database techniques for the World-Wide Web: A Survery. In: ACM The SIGMOD Record, 1998.59-74
  • 2Atzeni P, Mecca G, Merialdo P. To weave the Web. In: Proc the 23rd International Conference on Very Large Data Bases. Athens, Greece, 1997. 206-215
  • 3Pemberton S et al. XHTML 1.0: The extensible hyperText markup language. In: http://www.w3.org/MarkUp/
  • 4Cattell R G G. The Object Database Standard ODMG-93. San Mateo,California: Morgan Kaufmann Publishers,1994
  • 5Mitchell T. Machine Learning. New York: McGraw Hill, 1997
  • 6Wall L et al. Programming Perl(3rd Edition). O'Reilly & Associates,2000
  • 7Birbeck M et al. Professional XML. Wrox Press Inc, 2000
  • 8Liu L, Pu C, Han W. XWRAP: An XML-enabled wrapper construction system for web information sources. In: Proc International Conference on Data Engineering (ICDE), San diego, California, 2000. 611-621
  • 9Chamberlin D, Robie J, Florescu D. Quilt: An XML query language for heterogeneous data sources. In: Proc International Workshop on the Web and Databases (WebDB'2000), Dallas, Texas, 2000. 53-62
  • 10Sahuguet A, Azavant F. Building light-weight wrappers for legacy web datasources using w4f. In: Proc International Conference on Very Large Databases, Edinburgh, Scotland, 1999. 738-741

共引文献132

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部