Web信息抽取网页自动浏览导航与集成规则研究被引量：1

Research on Automated Web Navigation and Data Integration Rules for Web Information Extraction

下载PDF

导出

摘要 Web中蕴藏着大量有价值的数据,过去十几年中,针对Web信息抽取技术已有较多的研究。而现有的研究和系统多集中在数据抽取处理阶段,忽略或简化了完整的Web信息抽取过程需要的网页自动浏览导航和集成处理。为克服这些不足,提出了包含浏览导航、数据抽取和集成过程的三阶段Web信息抽取处理模型,基于此进一步研究提出了自动浏览导航模型,并设计实现了网页自动浏览导航规则语言。研究提出了一种Web数据抽取、转换和集成(extraction-transformation-integration,ETI)模型,设计实现了一套灵活有效的数据集成和流程控制规则语言,能有效地维护跨网页数据记录的复杂关系,并提供灵活的流程控制能力。抽取实例的结果表明,该规则语言和系统可有效完成全过程化的Web信息抽取集成处理功能。 Web contains large amount of valuable data information. Many Web information extraction techniques have been studied in past decade. However, most of existing studies and systems focus on data extraction processing from acquired Web pages, and ignore or simplify the automated navigation and data integration processes. To solve the problem, this paper proposes a three-stage Web information extraction model including automated navigation, data extraction and data integration. Based on this model, this paper designs a navigation model along with an auto-mated navigation rule language. Furthermore, this paper proposes an ETI （extraction-transformation-integration） model and an integration and workflow control rule language, which can effectively maintain the complex relation-ship for cross-page data record and provide flexible workflow control. Extraction results show that the proposed＆amp;nbsp;rule language and the implemented system can effectively achieve Web page navigation and data extraction.

作者王海涛张志亮孙煜华袁春风黄宜华

机构地区南京大学计算机科学与技术系南京大学计算机软件新技术国家重点实验室广州供电局信息中心

出处《计算机科学与探索》 CSCD 2014年第9期1049-1066,共18页 Journal of Frontiers of Computer Science and Technology

基金国家自然科学基金江苏省科技支撑计划项目~~

关键词 WEB信息抽取自动浏览导航数据集成流程控制规则语言 Web information extraction automated Web navigation data integration workflow control rule language

分类号 TP317 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献27

1Zhai Yanhong, Liu Bing. Web data extraction based on par- tial tree alignment[C]//Proceedings of the 14th International Conference on World Wide Web (WWW '05),Chiba, Japan, 2005. New York, NY, USA: ACM, 2005: 76-85.
2Liu Wei, Meng Xiaofeng, Meng Weiyi. Vide: a vision-based approach for deep Web data extraction[J]. IEEE Transactions on Knowledge and Data Engineering, 2010, 22(3): 447-460.
3Liu Bing, Grossman R, Zhai Yanhong. Mining data records in Web pages[C]//Proceedings of the 9th ACM SIGKDD Inter- national Conference on Knowledge Discovery and Data Mining (KDD '03), Washington, USA, 2003. New York, NY, USA: ACM, 2003: 601-606.
4Muslea 1, Minton S, Knoblock C. A hierarchical approach to wrapper induction[C]//Proceedings of the 3rd Annual Con- ference on Autonomous Agents (AGENTS '99), Seattle, USA,1999. New York, NY, USA: ACM, 1999: 190-197.
5Baumgartner R, Sergio F, Georg G. Visual Web information extraction with Lixto[C]//Proceedings of the 27th Interna- tional Conference on Very Large Data Bases (VLDB' 01). San Francisco, CA, USA: Morgan Kaufmann, 2001:119-128.
6Baumgartner R, Georg G, Marcus H. Scalable Web data ex- traction for online market intelligence[J]. Proceedings of the VLDB Endowment, 2009, 2(2): 1512-1523.
7Baumgartner R, Ledermiiller G. Deep Web navigation in Web data extraction[C]//Proceedings of the International Conference on Computational Intelligence for Modelling, Control and Automation and International Conference on Intelligent Agents, Web Technologies and Internet Com- merce (CIMCA-IAWTIC '05), Vienna, Austria, 2005. Wash- ington, DC, USA: IEEE Computer Society, 2005: 698-703.
8Laender A H, Ribeiro-Neto B, Silva A S. DEByE-data extrac- tion by example[J]. IEEE Transactions on Knowledge and Data Engineering, 2002, 40(2): 121-154.
9Anupam V, Freire J, Kumar B, et al. Automating Web navi- gation with the WebVCR[J]. Computer Networks, 2003, 33(1): 503-517.
10Safonov A, Konstan J A, Carlis J V. Beyond hard-to-reach pages: interactive, parametric Web macros[C]//Proceedings of the 7th Conference on Human Factors & the Web (HFWeb ' 01),Wisconsin, USA, 2001.

同被引文献9

1王桂玲,张峰,韩燕波.一种基于数据服务超链进行情景数据集成的方法[J].电信科学,2014,30(2):51-59. 被引量：1
2潘华,王淑营,孙林夫,吕瑞.面向产业链协同SaaS平台多源信息动态集成安全技术研究[J].计算机集成制造系统,2015,21(3):813-821. 被引量：6
3王潇娴.基于视觉传达设计领域的互补设计方法研究[J].包装工程,2015,36(8):112-115. 被引量：4
4聂俊岚,陈贺敏,张继凯,郭栋梁.基于数据相似度的多维海洋数据交互式集成可视化[J].海洋通报,2015,34(5):586-591. 被引量：5
5王梦喆,孔繁强.基于大数据的二次元社交网站体验设计[J].包装工程,2016,37(8):36-39. 被引量：4
6韦艳丽,赵韩,杨亚荣.基于逆向反演的交互式网页动画设计方法研究[J].艺术百家,2016,32(2):252-253. 被引量：3
7冯兴利,洪丹丹,罗军锋,锁志海.自适应网页设计中的关键技术[J].计算机应用,2016,36(A01):249-251. 被引量：14
8卢晓勇,陈木生,吴政隆,张百栈.基于免疫克隆特征选择和欠采样集成的垃圾网页检测[J].计算机应用,2016,36(7):1899-1903. 被引量：3
9张世锋.技术推动观念 VR技术引发的视觉传达新观念[J].新美术,2016,37(11):87-91. 被引量：3

引证文献1

1曲兴卫,王自珍.多源数据集成的视觉传达设计仿真研究[J].现代电子技术,2018,41(13):172-176. 被引量：3

二级引证文献3

1范威振,陈占芳,刘燕龙.基于多维相似度的整体式实体统一算法研究[J].长春理工大学学报（自然科学版）,2019,42(4):114-119. 被引量：1
2王磊,王艳贞,王晓芬.基于多层结构的平面视觉元素可视化系统设计[J].现代电子技术,2021,44(8):105-108.
3潘娜,潘伟.基于色彩印刷符号分析的自动化视觉传达可视融合系统设计[J].制造业自动化,2021,43(8):74-77.

1李文革.为WORD97增加自动浏览功能[J].新潮电子,1998(11):58-58.
2刘磊,吴芝明,林涛,刘大瑞.基于马尔科夫模型的移动设备链接预测研究[J].四川大学学报（自然科学版）,2015,52(1):45-50. 被引量：3
3李剑波,李小华,董树明,杨科华.一种基于XML的Web信息抽取方法[J].情报杂志,2006,25(8):49-51. 被引量：7
4董旻,方曙.Deep Web信息抽取研究[J].图书情报工作,2007,51(10):25-28. 被引量：5
5高清彩屏电子书——昂达VX560全高清播放器[J].电脑迷,2010(17):36-36.
6阳小华,周龙骧.WWW浏览导航与结构优化技术[J].计算机科学,2000,27(9):78-81. 被引量：1
7朱佳,张忠能.一种基于聚类的全自动网页数据记录抽取方法[J].微型电脑应用,2010,26(12):5-7.
8沈志宏,黎建辉,张晓林.关联数据互联技术研究综述:应用、方法与框架[J].图书情报工作,2013,57(14):125-133. 被引量：17
9陶建辉.ETI的屏幕管理及其与终端和UNIX的配合[J].计算机科学技术与应用,1993(6):51-53.
10Lianyin JIA,Jianqing XI,Mengjuan LI,Yong LIU,Decheng MIAO.ETI： an efficient index for set similarity queries[J].Frontiers of Computer Science,2012,6(6):700-712. 被引量：2

计算机科学与探索

2014年第9期

浏览历史

内容加载中请稍等...

Web信息抽取网页自动浏览导航与集成规则研究被引量：1

参考文献27

同被引文献9

引证文献1

二级引证文献3

相关作者

相关机构

相关主题

浏览历史

Web信息抽取网页自动浏览导航与集成规则研究 被引量：1

参考文献27

同被引文献9

引证文献1

二级引证文献3

相关作者

相关机构

相关主题

浏览历史

Web信息抽取网页自动浏览导航与集成规则研究被引量：1