期刊文献+

基于DOM树的可适应性Web信息抽取 被引量:16

Adaptive Web Information Extraction Based on DOM Tree
下载PDF
导出
摘要 Web信息抽取通常采用的是一种归纳学习方法,从给定的训练样本网页中学习到抽取规则,这种方法虽然能够准确地抽取出信息,但是当网站的模版发生改变后,必须重新获得抽取规则,因而这种抽取器的维护成本比较高,可适应性差。提出一种新的可适应性Web信息抽取方法,该方法首先通过聚类方法获取商品在网页中频繁出现的关键词组,然后利用网页的DOM树结构来确定包含这些关键词的信息块,从而实现Web信息的自动抽取。对大量商业网站进行信息抽取的实验表明,该算法不仅能有效抽取出商品信息,而且是一种与站点结构无关的可适应性信息抽取方法。 Many Web information extraction methods are related to wrapper induction. It extracts the items by the rules learnt from the Web pages used for training. Although it can get the information accurately, it is hard to be maintained when the template of the Web site is changed,as it needs to learn the rules again. In our research,we put forward a new adaptive Web information extraction. It determines the block which contains all information about the merchandise by u- sing the keywords of a certain topic,which is based on DOM tree structure. The experiments on a great amount of Web pages show that our method can not only extract the information efficiently, but also is irrelevant to the site structure, which can be widely used for many different Web information extractions.
出处 《计算机科学》 CSCD 北大核心 2009年第7期202-203,210,共3页 Computer Science
基金 广东省自然科学基金(No.07006474)资助
关键词 DOM树 信息抽取 可适应性 DOM tree, Information extraction, Adaptive
  • 相关文献

参考文献1

二级参考文献11

  • 1Crescenzi V,Mecca G.Grammars have exceptions.Information Systems,1998,23(8)
  • 2Hammer J,Garcia-Molina H,Cho J,et al.Extracting semistructured information from the Web.In:Proc.of the Workshop on the Management of Semistructured Data,1997
  • 3Huck G,Frankhauser P,Aberer K,et al.Jedi:Extracting and synthesizing information from the web.In:CoopIS,1998
  • 4Lerman K,Minton S N,Knoblock C A.Wrapper Maintenance:A Machine Learning Approach.Journal of Artificial Intelligence Research,2003,18:149~181
  • 5Arasu A,Garcia-Molina H.Extracting Structured Data from Web Pages.In:SIGMOD 2003,San Diego,CA,June 2003
  • 6Soderland S.Learning information extraction rules for semistructured and free text.Machine Learning,1999,34:1~3
  • 7Brin S,Page L.The Anatomy of a Large-Scale Hypertextual Web Search Engine
  • 8Han Jiawei,Kamber M.Data Mining:Concepts and Techniques.China:China Machine Press,2001
  • 9Cai Deng,Yu Shi-Peng,Wen Ji-Rong,et al.Block-based Web Search.SIGIR'04,Sheffield,South Yorkshire,UK,July 2004
  • 10Cai Deng,He Xiao-Fei,Wen Ji-Rong,et al.Block-level Link Analysis.SIGIR'04,Sheffield,South Yorkshire,UK,July 2004

共引文献7

同被引文献97

引证文献16

二级引证文献39

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部