摘要
为了解决网络信息采集过程中复杂的脚本解析和异步数据交互等一系列问题,提出了基于浏览器内核的网络信息采集方法;构建了以浏览器内核为核心的网络爬虫系统;在采集性能和采集可行性两个方面对系统进行了测试。以浏览器内核作为数据采集系统的网页解析引擎,来执行网页中的各种客户端脚本以及完成复杂的数据交互,从而完整地将隐藏在深层网中的URL等有用数据提取出来。随着网络应用的发展,未来的网页结构会越来越复杂化,因此传统网络爬虫的采集难度会逐步增加,而基于浏览器内核的网络爬虫则可以很好地适应这些变化。
In order to solve a asynchronization of data interaction in series of problems such as complex script parsing and the process of web information collection, a method of web information collection based on browser kernel is proposed. A web crawler system based on the kernel of browser is constructed, and the system is tested in both performance and feasibility. The browser kernel is used as the web page parsing engine of the web crawler to execute various client scripts in the web page and to complete the complex data interaction, thus the useful data hidden in the deep web such as URL can be extracted. With the development of network applications, the structure of the web pages will become more and more complex, leading to the mounting difficulty of the traditional web crawler, but the web crawler based on the browser kernel can well accommodate these changes.
作者
陈新
都云程
肖诗斌
CHEN Xin;DU Yuncheng;XIAO Shibin(Computer School,Beijing Information Science & Technology University,Beijing 100101,China;Beijing TRS Information Technology Co.,Ltd,Beijing 100101,China)
出处
《北京信息科技大学学报(自然科学版)》
2018年第5期60-64,共5页
Journal of Beijing Information Science and Technology University
基金
863计划课题"面向基础教育的知识能力智能测评与类人答题验证系统"(2015AA015409)