期刊文献+

基于XML的WEB信息自动抽取方法的研究

A Method of Web Information Automatic Extraction Based on XML
下载PDF
导出
摘要 互联网的快速发展以及Web数据的日益庞大,使用户从Web中获取有用信息变得日益困难,如何快速有效地从Web中准确抽取信息已经成为亟待解决的问题,Web信息抽取技术应运而生.提出了一种新的基于XML的WEB信息自动抽取方法,采用数据转换算法将HTML文档标准化,通过学习样本实例的XPATH表达式,形成抽取规则库,并利用规则库对其它同类页面实现信息的自动抽取.实验结果表明,该方法具有较高的查全率和查准率,且抽取结果具有自描述性,方便于建立各个领域的数据抽取系统. With the increasingly high-speed of the internet as well as the increase in the amount of data it contains,users are finding it more and more difficult to gain useful information from the web.How to extract accurate information from the Web efficiently has become an urgent problem.Web information extraction technology has emerged to solve this kind of problem.The method of Web information auto-extraction based on XML is designed through standardizing the HTML document using data translation algorism,forming a extracting rule base by learning the XPath expression of samples,and using extraction rule base to realize auto-extraction of pages of same kind.The results show that this approach shoud lead to a higher recall ratio and precision ratio,and the result should have a self-description,making it convenient for founding data extraction system of each domain.
出处 《河北工业大学学报》 CAS 北大核心 2010年第5期73-77,共5页 Journal of Hebei University of Technology
基金 天津市应用基础与前沿技术研究计划(10JCZDJC16000)
关键词 XML XPATH学习 XSL 信息抽取 DOM树 XML XPath learning XSL information extraction DOM tree
  • 相关文献

参考文献7

二级参考文献21

  • 1Ralph Grishm.An information extraction:Techniques and challenges[C].Information Extraction Springer-Verlag, Lecture Nots in Artificial Intelligece, 1997.
  • 2Alan Wessman,Stephen W Liddle,David W Embley.A generalized framework for an ontology-based data-extraction systemiC]. Proc of the 4th Int Confon Information Systems Technology and its Applications,2005:239-253.
  • 3Hobbs,Jerry, Douglas Appelt,et al.FASTUS:A cascated fmte-state transducer for extracting information from natural-language text [C].Technical Note No 519 SRI Intemational Artificial Intelligence Center, 1992
  • 4Rohini K Srihari,Wei Li,Cheng Niu, et al.InfoXtract:A customizable intermediate level information extraction engine[C].Pro-ceedings of HLT/NAACL 2003 Workshop on Software Engineering and Architecture of Language Technology Systems (SEALTS),2003:52-59.
  • 5Srihari R,Li W.A question answering system supported by information extraction[C].Seattle:Proceedings of ANLP,2000.
  • 6David W Embley, Cui Tao,Stephen W Liddle.Automatically extracting ontologically specified data from HTML tables of unknown structure [C]. Finland: Proceedings of the 21 st International Conference on Conceptual Modeling(ER'02),2002.
  • 7Kai Mertins,Peter Heisig,Jens Vorbeck,et al.Knowledge management concepts and best practices [C]. Springer-Verleg Berlin Heidelbeg New York,2003.
  • 8Heiist G.The role of ontology in knowledge engineering[D].Amsterdam:University of Amsterdam,1995.
  • 9Gruber T.Towards principles for the design of ontologies used for knowledge sharing[J].Intemational Journal of Human-Computer Students, 1995,43(5/6):907-928.
  • 10Harith Alani,Sanghee Kim,David E Millard,et al.Automatic ontology-based knowledge extraction from web documents [J]. IEEE Intelligent Systems,2003,18(1): 14-21.

共引文献44

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部