基于Ontology的Web内容二阶段半自动提取方法被引量：18

Ontology-Based Two-Phase Semi-Automatic Web Extracting

下载PDF

导出

摘要目前Web中的海量信息已经成为人们重要的信息来源 ,如何从大量半结构化或无结构的HTML网页中提取信息已成为目前的研究热点 .但是Web页面的初始设计目的是为了方便用户浏览 ,而不是便于应用程序自动处理 ,如何实现一个精确的、应用广泛的提取系统面临很多困难 .传统的方法可以粗略划分为基于交互产生的包装程序和自动生成的包装程序 ,但是基于交互产生的包装程序不具备普遍的应用性 ,基于自动生成的包装程序准确性不高 .该文提出了一种新的二阶段基于语义的半自动提取方法 ,在保证提取准确性的前提下 ,尽可能减少交互操作 ,同时随着参与网站的增加 ,逐步提高包装程序生成的自动化 .相对于目前的方法 ,该文方法同时考虑了包装程序提取结果的准确性和提取过程的应用普遍性 .其有效性在原型系统中得到验证 .应用该方法 ,已经成功提取了12 0万HTML页面 . The massive information on the Web has become an important information source for people. How to extract information from semi-structured or unstructured HTML pages receives much attention. However, the original intention of web pages is not to be processed by application automatically, but to be browsed by users. It is difficult to design a precise web data wrapper with high applicability. Roughly, existing methods can be classified into interactive-based wrapper generation and automatically wrapper generation, but the former lacks applicability while the latter lacks the precision of extraction. This paper proposes a novel two-phase semi-automatically precise web extracting method. The method tries to reduce the interactive work in wrapper generation process as much as possible while it should maintain the precision of extraction result at the same time. In addition, with the increase of extracted web pages, the automaticity in the process will also be improved. Compared with the existing methods, the method proposed takes both the precision of query result and the applicability of wrapper into consideration. The method has been validated in authors' prototype, which has extracted 1,200 thousand web pages successfully.

作者高军王腾蛟杨冬青唐世渭

机构地区北京大学信息科学技术学院北京大学视觉与听觉处理国家重点实验室

出处《计算机学报》 EI CSCD 北大核心 2004年第3期310-318,共9页 Chinese Journal of Computers

基金国家"九七三"重点基础研究发展规划项目基金 (G1 9990 32 70 5 ) 国家"八六三"高技术研究发展计划项目基金 ( 2 0 0 2AA4Z34 40 )资助

关键词 Internet 搜索引擎系统信息获取 Web ONTOLOGY 网页分类半自动提取法 Database systems Feature extraction Hypertext systems Internet User interfaces

分类号 TP393.4 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献11

1[1]Baumgartner R.,Flesca S.,Gottlob G.. Visual web information extraction with lixto. In: Proceedings of the 27th International Conference on Very Large Data Bases, Roma, 2001,119～128
2[2]Liu L.,Pu C., Han W.. XWRAP: An XML-enabled wrapper construction system for web information sources. In: Proceedings of the 16th International Conference on Data Engineering, California, 2000, 611～621
3[3]Gottlob G., Koch C.. Monadic datalog and the expressive power of languages for web Information extraction. In: Proceedings of the 21th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, Wisconsin, 2002, 17～28
4[4]Hamer J.,Brennig M., Garcia-Molina H.. Template-based wrappers in the TSIMMIS system. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, Arizona, 1997, 532～535
5[5]Atzeni P., Mecca G.. Cut and paste. In: Proceedings of the 16th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, Arizona, 1997, 144～153
6[6]Crescenzi V., Mecca G., Merialdo P.. RoadRunner: Towards automatic data extraction from large web sites. In: Proceedings of the 27th International Conference on Very Large Data Bases, Roma, 2001, 109～118
7[7]Soderland S.. Learning information extraction rules for semistructured and free text. Machine Learning,1999, 34(1～3):233～272
8[8]Adelberg B.. Nodose-A tool for semi automatically extracting structured and semi-structured data from text document. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, Washington, 1998, 283～294
9[9]Ribeiro-Neto B.A., Laender A., da silva A.S.. Extracting semistructured data through examples. In: Proceedings of the 1999 ACM CIKM International Conference on Information and Knowledge Management, Missouri, 1999,94～101
10[10]EmbleyD.W., Campbell D.M., Jiang Y.S.. A conceptual-modeling approach to extracting data from web. In: Proceedings of the 17th International Conference on Conceptual Modeling, Singapore, 1998,78～91

同被引文献144

1樊延平,马亚龙,袁野.军事想定数据挖掘技术研究[J].系统仿真学报,2006,18(z2):172-174. 被引量：3
2方卫东,袁华,刘卫红.基于Web挖掘的领域本体自动学习[J].清华大学学报（自然科学版）,2005,45(S1):1729-1733. 被引量：31
3许勇,荀恩东,贾爱平,宋柔.基于互连网的术语定义获取系统[J].中文信息学报,2004,18(4):37-43. 被引量：13
4周明建,高济,李飞.基于本体论的Web信息抽取[J].计算机辅助设计与图形学学报,2004,16(4):535-541. 被引量：34
5黄玲,陈龙.基于网页分块的正文信息提取方法[J].计算机应用,2008,28(S2):326-328. 被引量：13
6陈兰,左志宏,熊毅,孟令谦.一种新的基于Ontology的信息抽取方法[J].计算机应用研究,2004,21(8):155-157. 被引量：18
7袁洋,李善平.基于语义Web的本体映射方法综述[J].计算机科学,2004,31(5):5-8. 被引量：12
8胡东东,孟小峰.一种基于树结构的Web数据自动抽取方法[J].计算机研究与发展,2004,41(10):1607-1613. 被引量：21
9邹纲,刘洋,刘群,孟遥,于浩,西野文人,亢世勇.面向Internet的中文新词语检测[J].中文信息学报,2004,18(6):1-9. 被引量：59
10李向阳,陆建江,张亚非.基于竞争分类的Web信息抽取[J].电子学报,2004,32(11):1915-1917. 被引量：2

引证文献18

1黄玲,陈龙.基于网页分块的正文信息提取方法[J].计算机应用,2008,28(S2):326-328. 被引量：13
2李石君,于俊清,欧伟杰.基于HTML模式代数的Web信息提取方法[J].计算机研究与发展,2006,43(9):1644-1650. 被引量：8
3胡国平,张巍,王仁华.基于双层决策的新闻网页正文精确抽取[J].中文信息学报,2006,20(6):1-9. 被引量：16
4张瑞,李石君.网上表格数据到XML的自动转换[J].计算机工程与应用,2007,43(2):190-192. 被引量：5
5卢林兰,李明.利用ontology实现的多库知识获取方法[J].计算机工程与设计,2007,28(15):3731-3733. 被引量：1
6任仲晟,薛永生.基于页面标签的Web结构化数据抽取[J].计算机科学,2007,34(10):133-136. 被引量：8
7李纲,戴强斌.WNBTE网页正文抽取方法研究[J].情报科学,2008,26(3):333-336. 被引量：5
8刘辉,陈静玉,徐学洲.基于模板流程配置的Web信息抽取[J].计算机工程,2008,34(20):55-57. 被引量：5
9钱爱兵.一种基于统计的中文网页正文抽取方法[J].情报学报,2009,28(2):187-194. 被引量：3
10柳佳刚,陈山,贺令亚.基于本体和DOM相结合的Web信息抽取器[J].现代图书情报技术,2009(5):44-49. 被引量：5

二级引证文献83

1吴欢,应俊,王逸飞,胡华宇,徐洪丽,郑一琼.乳腺癌病理文本的结构化信息提取[J].解放军医学院学报,2020,41(7):746-751. 被引量：9
2唐坚,刘海燕.作战文书中部队番号的自动识别方法[J].兵器装备工程学报,2020,0(2):143-147. 被引量：1
3黄玲,陈龙.基于网页分块的正文信息提取方法[J].计算机应用,2008,28(S2):326-328. 被引量：13
4赵靖,王侨文,管马周,单传佳.自动提取布局结构相似网页的结构化信息[J].安徽科技学院学报,2010,24(6):37-42. 被引量：1
5施水才,程涛,王霞,吕学强.基于网页内容的广告推介研究[J].中文信息学报,2007,21(4):42-47. 被引量：1
6冯少卿,都云程.网页结构模板生成新方法研究[J].北京机械工业学院学报,2007,22(3):15-19. 被引量：2
7韩忠明,李文正,莫倩.有效HTML文本信息抽取方法的研究[J].计算机应用研究,2008,25(12):3568-3571. 被引量：15
8吕聚旺,都云程,王弘蔚,施水才.基于新型主题信息量化方法的Web主题信息提取研究[J].现代图书情报技术,2008(12):48-53. 被引量：1
9柳佳刚,陈山,贺令亚.基于本体和DOM相结合的Web信息抽取器[J].现代图书情报技术,2009(5):44-49. 被引量：5
10贾长云,程永上.HTML表格向XML的智能转换[J].计算机工程,2009,35(14):32-34. 被引量：3

1陈科.Bordland C++Builder打包利器InstallShield Express[J].成都工业学院学报,2003,15(1):14-16.
2驱动程序巧还原[J].现代计算机（中旬刊）,2005(8):55-55.
3叶迎海.自动编制AutoCAD“型”文件[J].电脑编程技巧与维护,1997(11):62-64.
4程文欣.浅谈Windows的应用(六)[J].知识就是力量,1996,0(12):14-15.
5周钰,孙晖,沈远,赵菁.基于改进BP神经网络的坐姿评价模型研究[J].电子技术（上海）,2017,46(3):17-19. 被引量：1
6王峰,陈蕴.一种提高不同类型指纹传感器匹配性能方法的研究[J].传感器与微系统,2010,29(1):65-68. 被引量：1
7但玻.驱动程序提取备份工具——WinDriver Expert v1.90[J].电脑校园,2004(10):24-24.
8周裕娟,张红梅,张向利,李鹏飞.基于Android权限信息的恶意软件检测[J].计算机应用研究,2015,32(10):3036-3040. 被引量：18
9金炎,孙伟,唐慧强,张小瑞,陈胜.基于全局和局部特征融合的车型识别[J].计算机工程与设计,2016,37(4):1051-1055. 被引量：2
10杜华荣.软件包装程序的设计[J].重庆建筑大学学报,1998,20(1):67-73. 被引量：2

计算机学报

2004年第3期

浏览历史

内容加载中请稍等...

基于Ontology的Web内容二阶段半自动提取方法被引量：18

参考文献11

同被引文献144

引证文献18

二级引证文献83

相关作者

相关机构

相关主题

浏览历史

基于Ontology的Web内容二阶段半自动提取方法 被引量：18

参考文献11

同被引文献144

引证文献18

二级引证文献83

相关作者

相关机构

相关主题

浏览历史

基于Ontology的Web内容二阶段半自动提取方法被引量：18