期刊文献+

Web信息的自主抽取方法 被引量:15

Autonomous Extract Information from Web Pages
下载PDF
导出
摘要 提出了基于表格结构及列表结构的W eb页面信息自主抽取的方法。可根据用户对信息的需求自主地从相关页面中抽取信息并将抽取信息按关系模型进行重组存放在数据库中,对表格结构信息源仅需标注一页网页,即可获取抽取知识,通过自学习能够较好地适应网页信息的动态变化,实现信息的自动抽取。对列表结构信息源信息,通过对DOM树结构的分析,动态获得信息块在DOM层次结构中的路径,根据信息对象基本的抽取知识,获得信息对象值。采用自学习的方法以适应网页信息的动态变化。 The paper presents a method of autonomous information extraction from web pages base on structure of table and list.The method utilizes extracting information from relevant pages autonomously according user's demand and relation model restructuring extracted information to database.For extracting information from table,earmark only one page and get extraction knowledge for extracting information from table.Wrapper can be adapted to the pages' changes with self-learning and make it automatic extraction.For extracting information from list,wrapper can automatic get the path,which the information block is in layer structure of DOM by analysing structure of DOM,and get the value of information object base on extraction knowledge.Adapt to Web page's dynamic change by self-learning.
作者 许建潮 侯锟
出处 《计算机工程与应用》 CSCD 北大核心 2005年第14期185-189,198,共6页 Computer Engineering and Applications
关键词 WEB 半结构化数据 信息抽取 WRAPPER Web,semi-structured data,information extraction,Wrapper
  • 相关文献

参考文献11

  • 1黄豫清,戚广志,张福炎.从WEB文档中构造半结构化信息的抽取器[J].软件学报,2000,11(1):73-78. 被引量:47
  • 2朱明,黄云,蔡庆生.基于多知识的Web网页信息抽取方法[J].小型微型计算机系统,2001,22(9):1058-1061. 被引量:10
  • 3李效东,顾毓清.基于DOM的Web信息提取[J].计算机学报,2002,25(5):526-533. 被引量:101
  • 4张义忠,赵明生,朱精南.基于内容的网页特征提取[J].计算机工程与应用,2001,37(10):1-3. 被引量:9
  • 5周源远,王继成,郑刚,张福炎.Web页面清洗技术的研究与实现[J].计算机工程,2002,28(9):48-50. 被引量:20
  • 6Muslea I,Minton S,Knoblock C A.Hierarchical Wrapper Induction for Semistructured Information Sources[J].To Appear in the Journal of Autonomous Agents and Multi-Agent Systems, 1999.
  • 7Kurt D Bollacker,Steve Lawrence,C Lee Giles et al. CiteSeer:An Autonomous Web Agent for Automatic Retrieval and identification of Interesting Publications[C].In:Proceedings of 2nd International Conference on Autonomous Agent, 1998-04:116~123.
  • 8Jose Luis Ambite,Naveen Ashish,Craig Knoblock et al.A System for Constructing Mediators for Internet Source,System Demonstration[C].In:Proceedings of the ACM SIGMOD International,Conference on Management of Data, Seattle, Washington, 19983..
  • 9Stefano Ceri,Piero Fraternali,Aldo bongio[J].Web Modeling Language (WebML) :A modeling language for designing Web Sites[J].Computer Networks, 2000:137~157.
  • 10Embley D W,Campbell D M,Jiang Y S et al. Conceptual-ModelBased Data Extraction from Multiple-Record Web Documents[J].Data and Knowledge Engineering,1999.

二级参考文献25

  • 1Ham mar J,SIGMOD Record,1997年,26卷,2期,18页
  • 2Hammer J,Proceedings of the Workshop on Management of Semistructured Tucson,1997年,18~25页
  • 3Florescu D, Levy A Y, Mendelzon A. Database techniques for the World-Wide Web: A Survery. In: ACM The SIGMOD Record, 1998.59-74
  • 4Atzeni P, Mecca G, Merialdo P. To weave the Web. In: Proc the 23rd International Conference on Very Large Data Bases. Athens, Greece, 1997. 206-215
  • 5Pemberton S et al. XHTML 1.0: The extensible hyperText markup language. In: http://www.w3.org/MarkUp/
  • 6Cattell R G G. The Object Database Standard ODMG-93. San Mateo,California: Morgan Kaufmann Publishers,1994
  • 7Mitchell T. Machine Learning. New York: McGraw Hill, 1997
  • 8Wall L et al. Programming Perl(3rd Edition). O'Reilly & Associates,2000
  • 9Birbeck M et al. Professional XML. Wrox Press Inc, 2000
  • 10Liu L, Pu C, Han W. XWRAP: An XML-enabled wrapper construction system for web information sources. In: Proc International Conference on Data Engineering (ICDE), San diego, California, 2000. 611-621

共引文献177

同被引文献63

引证文献15

二级引证文献43

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部