基于XML的WEB信息自动抽取方法的研究

A Method of Web Information Automatic Extraction Based on XML

下载PDF

导出

摘要互联网的快速发展以及Web数据的日益庞大,使用户从Web中获取有用信息变得日益困难,如何快速有效地从Web中准确抽取信息已经成为亟待解决的问题,Web信息抽取技术应运而生.提出了一种新的基于XML的WEB信息自动抽取方法,采用数据转换算法将HTML文档标准化,通过学习样本实例的XPATH表达式,形成抽取规则库,并利用规则库对其它同类页面实现信息的自动抽取.实验结果表明,该方法具有较高的查全率和查准率,且抽取结果具有自描述性,方便于建立各个领域的数据抽取系统. With the increasingly high-speed of the internet as well as the increase in the amount of data it contains,users are finding it more and more difficult to gain useful information from the web.How to extract accurate information from the Web efficiently has become an urgent problem.Web information extraction technology has emerged to solve this kind of problem.The method of Web information auto-extraction based on XML is designed through standardizing the HTML document using data translation algorism,forming a extracting rule base by learning the XPath expression of samples,and using extraction rule base to realize auto-extraction of pages of same kind.The results show that this approach shoud lead to a higher recall ratio and precision ratio,and the result should have a self-description,making it convenient for founding data extraction system of each domain.

作者宋洁张娜刘艳柳顾军华

机构地区河北工业大学计算机科学与软件学院

出处《河北工业大学学报》 CAS 北大核心 2010年第5期73-77,共5页 Journal of Hebei University of Technology

基金天津市应用基础与前沿技术研究计划(10JCZDJC16000)

关键词 XML XPATH学习 XSL 信息抽取 DOM树 XML XPath learning XSL information extraction DOM tree

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献7

1毕蕾,沈洁,徐法艳,魏榴花,朱燕,孙荣霜.领域本体指导的Web商品信息抽取[J].计算机工程与设计,2008,29(24):6393-6396. 被引量：9
2于鲁波,陈超.互联网商品信息抽取技术[J].计算机工程,2008,34(5):274-276. 被引量：5
3王放,顾宁,吴国文.基于本体的WEB表格信息抽取[J].小型微型计算机系统,2003,24(12):2142-2146. 被引量：17
4张绍华,徐林昊,杨文柱,薛文玲,李天柱.基于样本实例的Web信息抽取[J].河北大学学报（自然科学版）,2001,21(4):431-437. 被引量：19
5轩艳艳.基于XML的Web信息抽取研究与实现[D]武汉理工大学,武汉理工大学2008.
6David Buttler,Ling Liu and Calton Pu.A fully automated object extraction system for the world wide web〔C〕. International Conference on Distributed Computing Systems . 2001
7Sriram Raghavan,Hector Garcia-Molina.Crawling the Hidden Web. http://dbpubs.stanford.edu:8090/pub/2000-36 . 2000

二级参考文献21

1Ralph Grishm.An information extraction:Techniques and challenges[C].Information Extraction Springer-Verlag, Lecture Nots in Artificial Intelligece, 1997.
2Alan Wessman,Stephen W Liddle,David W Embley.A generalized framework for an ontology-based data-extraction systemiC]. Proc of the 4th Int Confon Information Systems Technology and its Applications,2005:239-253.
3Hobbs,Jerry, Douglas Appelt,et al.FASTUS:A cascated fmte-state transducer for extracting information from natural-language text [C].Technical Note No 519 SRI Intemational Artificial Intelligence Center, 1992
4Rohini K Srihari,Wei Li,Cheng Niu, et al.InfoXtract:A customizable intermediate level information extraction engine[C].Pro-ceedings of HLT/NAACL 2003 Workshop on Software Engineering and Architecture of Language Technology Systems (SEALTS),2003:52-59.
5Srihari R,Li W.A question answering system supported by information extraction[C].Seattle:Proceedings of ANLP,2000.
6David W Embley, Cui Tao,Stephen W Liddle.Automatically extracting ontologically specified data from HTML tables of unknown structure [C]. Finland: Proceedings of the 21 st International Conference on Conceptual Modeling(ER'02),2002.
7Kai Mertins,Peter Heisig,Jens Vorbeck,et al.Knowledge management concepts and best practices [C]. Springer-Verleg Berlin Heidelbeg New York,2003.
8Heiist G.The role of ontology in knowledge engineering[D].Amsterdam:University of Amsterdam,1995.
9Gruber T.Towards principles for the design of ontologies used for knowledge sharing[J].Intemational Journal of Human-Computer Students, 1995,43(5/6):907-928.
10Harith Alani,Sanghee Kim,David E Millard,et al.Automatic ontology-based knowledge extraction from web documents [J]. IEEE Intelligent Systems,2003,18(1): 14-21.

共引文献44

1王茹,宋瀚涛,陆玉昌.网页数据自动抽取系统[J].计算机工程与应用,2004,40(19):135-138. 被引量：8
2蓝军.EXCEL2002数据的WEB发布与利用[J].高校实验室工作研究,2006(1):27-29.
3胡安安,陈晋.基于知识库的Web文本挖掘模型K-WebMiner[J].科技导报,2006,24(4):68-71.
4邵良杉,那宝贵.基于Web挖掘的虚拟企业合作伙伴选择决策支持系统研究[J].计算机系统应用,2006,15(10):2-5. 被引量：6
5胡国平,张巍,王仁华.基于双层决策的新闻网页正文精确抽取[J].中文信息学报,2006,20(6):1-9. 被引量：16
6刘丹,谢庆生,顾新建.电子商务环境下产品本体构建技术研究[J].计算机应用,2007,27(3):752-755. 被引量：11
7张小英.EXCEL 2002数据的WEB发布与利用[J].内江科技,2007,28(6):113-113.
8刘勇军,聂规划.多信息源下本体自动抽取的实现[J].计算机应用研究,2007,24(11):183-184. 被引量：1
9赵洪,肖洪,薛德军,师庆辉.Web表格信息抽取研究综述[J].现代图书情报技术,2008(3):24-31. 被引量：11
10李纲,戴强斌.WNBTE网页正文抽取方法研究[J].情报科学,2008,26(3):333-336. 被引量：5

1王建丽,丁振国.一种基于XML的Web数据挖掘技术[J].西安科技学院学报,2002,22(3):337-340. 被引量：8
2杨晓宇.Web的半结构化数据抽取的方法及其实现[J].太原师范学院学报（自然科学版）,2003,2(3):36-39.
3余久久,张佑生.软件探索性测试研究进展[J].实验室研究与探索,2014,33(2):93-102. 被引量：7
4钟志新.计算机“二—十”进制数据转换算法[J].国防科技大学学报,1993,15(1):31-36.
5李玉翠.数据转换算法在异构数据集成中的研究与应用[J].中国新技术新产品,2010(15):42-42.
6郑红霞,丁仁伟,万剑华.VRML文件数据的数据库存储方案研究[J].测绘科学,2008,33(5):198-200.
7金智勇,叶时平.基于XML的文档标准化系统研究[J].浙江树人大学学报（自然科学版）,2007,7(3):9-13.
8王淑蓉,张妍.基于BizTalk的异构系统集成研究[J].电脑知识与技术,2009,5(11):8725-8726. 被引量：2
9同晓荣,盛仲飙.数据转换算法在喷绘机打印中的应用[J].价值工程,2011,30(33):157-158. 被引量：1
10陈维斌,喻小光.一种XML数据到结构化数据的转换方法[J].华侨大学学报（自然科学版）,2003,24(2):201-207. 被引量：7

河北工业大学学报

2010年第5期

浏览历史

内容加载中请稍等...

基于XML的WEB信息自动抽取方法的研究

参考文献7

二级参考文献21

共引文献44

相关作者

相关机构

相关主题

浏览历史