摘要
目前网页数据获取技术仍然存在着动态网页难以解析、网络爬虫速度慢、抓取内容不准确等现象,为了避免此类情况的发生,文章设计了一套基于Selenium的多线程网页数据采集与分析算法。该算法的数据采集部分主要应用了python中用于自动运行和操作浏览器的Selenium库,完美地解决了动态和静态页面数据信息的获取问题,无界面版本浏览器、多线程网络爬虫技术以及关键词判别程序的使用,在很大程度上提高了网络爬虫速度和抓取内容准确度。并将该算法应用到在精准扶贫形式下的住户工资性收入调查数据的准确性判别中。最后以某地区人才市场网为例,抓取各行业工资水平的实时数据,通过对调查数据与抓取数据的比较分析判别住户调查中工资数据的准确性。
At present,there are still some problems in web data acquisition technology,such as difficulty to parse dynamic web pages,slow web crawler speed,inaccurate content capture,etc.In order to avoid these problems,this paper designs a set of multi-threaded webpage data acquisition and parsing algorithm based on Selenium.The data acquisition part of the algorithm mainly applies the Selenium Library in python for automatic operation and manipulating browsers,which perfectly solves the problem of obtaining dynamic and static page data information,no interface version of the browser,multi-threaded web crawler technology,and the use of keyword discriminant program,with the speed of web crawler and the accuracy of capture content improved to a great extent.And the algorithm is applied to the accuracy discrimination of household wage income survey data in the targeted poverty alleviation.Finally,taking the talent market network of a certain regional as an example,the paper captures the real-time data of wage levels in various industries to determine the accuracy of wage data in household surveys through the comparative analysis of survey data and capture data.
作者
沈承放
莫达隆
黄文韬
Shen Chengfang;Mo Dalong;Huang Wentao(School of Mathematics and Computer Science,Hezhou University,Hezhou Guangxi 542899,China;School of Mathematics and Statistics,Guangxi Normal University,Guilin Guangxi 541004,China)
出处
《统计与决策》
CSSCI
北大核心
2021年第7期52-56,共5页
Statistics & Decision
基金
国家社会科学基金西部项目(18XTJ002)
广西师范大学创新计划项目(XYCSZ2019088)