期刊文献+

基于扩展DOM树的Web页面信息抽取 被引量:12

INFORMATION EXTRACTION FROM WEB PAGE BASED ON EXTENDED DOM TREE
下载PDF
导出
摘要 随着Internet的发展,Web页面提供的信息量日益增长,信息的密集程度也不断增强。多数Web页面包含多个信息块,它们布局紧凑,在HTML语法上具有类似的模式。针对含有多信息块的Web页面,提出一种信息抽取的方法:首先创建扩展的DOM(Document ObjectModel)树,将页面抽取成离散的信息条;然后根据扩展DOM树的层次结构,并结合必要的视觉特性和语义信息对离散化的信息条重新整合;最后确定包含信息块的子树,深度遍历DOM树实现信息抽取。该算法能对多信息块的Web页面进行信息抽取。 With the development of Intemet,the amount as well as the density of information has increased day by day. Most of the time, a single web page contains several information blocks which are close in layout and have similar mode in HTML grammar. A method of information extraction is designed in dealing with multiple information-block web pages. First,the definition of an extended D0M tree is put forward, and a given web page is dispersed into pieces of information. Then, by combining the hierarchy information with the vision features and semantic information,these discrete pieces of information are aggregated into information blocks. Finally the information block are extracted out by depth-traversing the extended DOM tree. This algorithm is applicable in dealing with web pages containing several information blocks.
出处 《计算机应用与软件》 CSCD 北大核心 2007年第6期137-139,共3页 Computer Applications and Software
关键词 DOM树 信息抽取 包装器 半结构化 DOM tree Information extraction Wrapper Semi-structured
  • 相关文献

参考文献8

  • 1Ashish,Knoblock.Wrapper Generation for Semi-structured InternetSources[J].SIGMOD Record,1997,26(4):8-15.
  • 2Line Eikvil.Information Extraction from World Wide Web-A Survey[M].Report No.945,Norwegian Computing Center,ISBN 82-539-0429-0,July,1999.
  • 3Bouras C,Kapoulas V,Misedakis I.A Web-page Fragmentation Technique for Personalized Browsing[C].ACM SAC 2004,March,14-17,2004.
  • 4Arnaud Sahuguet,Fabien Azavant.Building Light-weight Wrappers for Legacy Web Data-sources Using W4F[C].International Conference on Very Large Databases,Edinburgh,Scotland,1999:738-741.
  • 5Cai Deng,Yu Shipeng,Wen Jirong,Ma Weiying.VIPS:a Vision-based Page Segmentation Algorithm[R].Technicla Report MSR-TR-2003-79,November,2003.
  • 6李效东,顾毓清.基于DOM的Web信息提取[J].计算机学报,2002,25(5):526-533. 被引量:101
  • 7Joachim Hammer,Hector Garcia-Molina,Junghoo Cho.Extracting Semi-structured Information from the Web[C].Proceedings of the First Workshop on Management of Semi-structured Data,Tucson,Arizona,1997:18-25.
  • 8张树瑜,朱仲英.基于MT决策树的Web信息抽取研究[J].计算机工程与应用,2004,40(13):69-71. 被引量:4

二级参考文献28

  • 1苏海菊,王永成.中文科技文献文摘的自动编写[J].情报学报,1989,8(6):433-439. 被引量:25
  • 2Florescu D, Levy A Y, Mendelzon A. Database techniques for the World-Wide Web: A Survery. In: ACM The SIGMOD Record, 1998.59-74
  • 3Atzeni P, Mecca G, Merialdo P. To weave the Web. In: Proc the 23rd International Conference on Very Large Data Bases. Athens, Greece, 1997. 206-215
  • 4Pemberton S et al. XHTML 1.0: The extensible hyperText markup language. In: http://www.w3.org/MarkUp/
  • 5Cattell R G G. The Object Database Standard ODMG-93. San Mateo,California: Morgan Kaufmann Publishers,1994
  • 6Mitchell T. Machine Learning. New York: McGraw Hill, 1997
  • 7Wall L et al. Programming Perl(3rd Edition). O'Reilly & Associates,2000
  • 8Birbeck M et al. Professional XML. Wrox Press Inc, 2000
  • 9Liu L, Pu C, Han W. XWRAP: An XML-enabled wrapper construction system for web information sources. In: Proc International Conference on Data Engineering (ICDE), San diego, California, 2000. 611-621
  • 10Chamberlin D, Robie J, Florescu D. Quilt: An XML query language for heterogeneous data sources. In: Proc International Workshop on the Web and Databases (WebDB'2000), Dallas, Texas, 2000. 53-62

共引文献102

同被引文献80

引证文献12

二级引证文献29

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部