期刊文献+

一种基于后缀树的包装器自动生成方法研究 被引量:2

Research of automatic wrapper generation method based on suffix tree
下载PDF
导出
摘要 包装器是一种能够从网页中自动抽取数据并将其转换为结构化数据的软件程序。现有的包装器生成系统多是半自动化的,需要用户具有关于目标页面的先验知识,而且大多只能处理简单结构数据,而不能很好地处理具有嵌套结构的数据。提出了一种基于后缀树的包装器自动生成方法,生成的包装器不仅可以处理简单结构数据,还可以处理嵌套结构数据,具有较低的时间复杂度,有一定的实用价值。 Wrappers are specialized program routines that automatically extract data from Web pages and convert the information into a structured format.Currently,most approaches to wrapper construction are semi-automated:they either need human involve- ment or have mainly focused on extracting plain-structured data objects with a fixed number of attributes and values,and usally cannot handle nested-structured data objects,whose instances may have variable number of values on their attributes.In this paper,makes a research on a suffix tree based automatic wrapper handle not only data objects with plain structures,but also those and has certain useful value. generation method.The wrapper generated by this method can with nested structures.This method decreases time complexity,
出处 《计算机工程与应用》 CSCD 北大核心 2007年第34期114-118,共5页 Computer Engineering and Applications
基金 国家自然科学基金(the National Natural Science Foundation of China under Grant No.60473042)。
关键词 网页 信息抽取 后缀树 半结构化数据 包装器自动生成 Web page information extraction suffix tree semi-structure data automatic wrapper generation
  • 相关文献

参考文献13

  • 1Georg Gottlob,Christoph Koeh.Monadie datalog and the expressive power of languages for Web information extraction[J].Journal of the ACM, 2004,51 ( 1 ):74-113.
  • 2Chang C H,Hsu C N,Lui Shao-cheng.Automatic information extraction from semi-structured Web pages by pattern discovery[J]. Decision Support Systems,2003,35(4):129-147.
  • 3Buttler D,Liu Ling,Pu C.A fully automated object extraction system for the World Wide Web[C]//Proceedings of the 2001 International Confference on Distrubuted Computing Systems,2001:361-370.
  • 4李保利,陈玉忠,俞士汶.信息抽取研究综述[J].计算机工程与应用,2003,39(10):1-5. 被引量:178
  • 5黄豫清,戚广志,张福炎.从WEB文档中构造半结构化信息的抽取器[J].软件学报,2000,11(1):73-78. 被引量:47
  • 6Muslea I,Minton S,Knoblock C A.Hierarchical wrapper induction for semistructured information sources[J],Autonomous Agents and Multi-Agent Systems,2001,4(1/2) : 93-114.
  • 7Kushmerick N.Wrapper induction: efficiency and expressiveness [J]. Artificial Intelligence, 2000,118 ( 1/2 ) : 15-68.
  • 8Meng X F,Wang H Y,Hu D D,et al.Schema guided wrapper maintenance:a demonstration[C]//Proceedings of ICDE2003,2003:750-752.
  • 9Grossi R,haliano G F.Suffix trees and their applications in string algorithms[C]//Proc 1st South American Workshop on String Processing, 1993 : 57-76.
  • 10Weiner P,Linear pattern matching algorithm[C]//Proc 14th IEEE Symposium on Switching and Automata Theory, 1973:1-11.

二级参考文献21

  • 1[16]Hobbs J,Appelt D,Bear J et al.FASTUS:A Cascaded Finite-State Transducer for Extracting Information from Natural-Language Text[C].In:Roche,Schabes eds. Finite State Devices for Natural Language Processing, MIT Press,Cambridge MA, 1996
  • 2[17]Appelt D E.Introduction to Information Extraction[J].AI COMMUNICATIONS, 1999; 12(3)
  • 3[18]Yangarber R.Scenario Customization for Information Extraction[D].Ph D Thesis.New York University,2001-01
  • 4[19]Cowie J, Lehnert W.Information Extraction[J].Communications of the ACM, 1996;39(1)
  • 5[20]Grishman R Adaptive information extraction and sublangu age analysis[C].In:Proceedings of IJCAI-2001 Workshop on Adaptive Text Extraction and Mining,2001
  • 6[1]Applet D E,Israel D J.Introduction to Information Extraction Technology. A Tutorial for IJCAI-99,1999
  • 7[2]Gaizauskas R,Wilks Y.Information Extraction:Beyond Document Retrieval[J].Journal of Documentation, 1997
  • 8[3]Sager N.Natural Language Information Processing. Reading,Massachusetts:Addison Wesley, 1981
  • 9[4]Dejong G.An Overview of the FRUMP System[C].In:LEHNERT W,RINGLE M h eds. Strategies for Natural Language Processing,Lawrence Erlbaum, 1982:149~176
  • 10[5]Grishman R,Sundheim B.Message Understanding Conference-6:A Brief History[C].In :Proceedings of the 16h International Conference on Computational Linguistics(COLING-96),1996-08

共引文献222

同被引文献21

  • 1陈琼,苏文健.基于网页结构树的Web信息抽取方法[J].计算机工程,2005,31(20):54-55. 被引量:24
  • 2王腾蛟,唐世渭,杨冬青,刘云峰.半结构化数据的局部精确模式提取方法[J].第十七届全国数据库学术会议(NDBC2000),2000,10:22-28.
  • 3EIKVIL L. Information extraction from World Wide Web--a survey [R]. [S. l. ] : Norwegian Computing Center, 1999.
  • 4ALBERTO H F, ALTIGRAN S, et al. A brief survey of Web data extraction tools [J]. SIGMOD Rec. , 2002, 31 (2).
  • 5CRESCENZI V, MECCA G, MERIALDO P. RoadRunner: towards automatic data extraction from large Web sites [ C ]// VLDB2001 : 109-118.
  • 6MENG Xiaofeng, L U Hongjun, et al. SG-WRAP: a schemaguided wrapper generator data engineering [ C ]//Proceedings of 18th International Conference on Data Engineering, 2002.
  • 7ARASU A, GARCIA-MOLINA H. Extracting structured data from Web pages [ C]//ACM SIGMOD Conference, 2003.
  • 8LIU B, GROSSMAN R, ZHAI Y. Mining data records in Web pages [C]//KDD2003, 2003: 601-606.
  • 9WANG J, LOCHOVSKY F H. Data extraction and label assignment for Web databases [ C] //Proceedings of the 12th International Conference on World Wide Web, 2003: 187-196.
  • 10ANTON T. XPath-Wrapper induction by generalizing tree traversal patterns [C]//LWA2005, 2005: 126-133.

引证文献2

二级引证文献4

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部