期刊文献+

一种深层网的数据采集方法 被引量:1

A method of data collection for deep web
下载PDF
导出
摘要 为了解决网络信息采集过程中复杂的脚本解析和异步数据交互等一系列问题,提出了基于浏览器内核的网络信息采集方法;构建了以浏览器内核为核心的网络爬虫系统;在采集性能和采集可行性两个方面对系统进行了测试。以浏览器内核作为数据采集系统的网页解析引擎,来执行网页中的各种客户端脚本以及完成复杂的数据交互,从而完整地将隐藏在深层网中的URL等有用数据提取出来。随着网络应用的发展,未来的网页结构会越来越复杂化,因此传统网络爬虫的采集难度会逐步增加,而基于浏览器内核的网络爬虫则可以很好地适应这些变化。 In order to solve a asynchronization of data interaction in series of problems such as complex script parsing and the process of web information collection, a method of web information collection based on browser kernel is proposed. A web crawler system based on the kernel of browser is constructed, and the system is tested in both performance and feasibility. The browser kernel is used as the web page parsing engine of the web crawler to execute various client scripts in the web page and to complete the complex data interaction, thus the useful data hidden in the deep web such as URL can be extracted. With the development of network applications, the structure of the web pages will become more and more complex, leading to the mounting difficulty of the traditional web crawler, but the web crawler based on the browser kernel can well accommodate these changes.
作者 陈新 都云程 肖诗斌 CHEN Xin;DU Yuncheng;XIAO Shibin(Computer School,Beijing Information Science & Technology University,Beijing 100101,China;Beijing TRS Information Technology Co.,Ltd,Beijing 100101,China)
出处 《北京信息科技大学学报(自然科学版)》 2018年第5期60-64,共5页 Journal of Beijing Information Science and Technology University
基金 863计划课题"面向基础教育的知识能力智能测评与类人答题验证系统"(2015AA015409)
关键词 浏览器内核 脚本解析 网络爬虫 深层网 brower kernel script parsing web crawler deep web
  • 相关文献

参考文献12

二级参考文献168

共引文献194

同被引文献13

引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部