期刊文献+

网页数据采集算法及在住户调查中的应用 被引量:2

Webpage Data Acquisition Algorithm and Its Application in Household Surveys
下载PDF
导出
摘要 目前网页数据获取技术仍然存在着动态网页难以解析、网络爬虫速度慢、抓取内容不准确等现象,为了避免此类情况的发生,文章设计了一套基于Selenium的多线程网页数据采集与分析算法。该算法的数据采集部分主要应用了python中用于自动运行和操作浏览器的Selenium库,完美地解决了动态和静态页面数据信息的获取问题,无界面版本浏览器、多线程网络爬虫技术以及关键词判别程序的使用,在很大程度上提高了网络爬虫速度和抓取内容准确度。并将该算法应用到在精准扶贫形式下的住户工资性收入调查数据的准确性判别中。最后以某地区人才市场网为例,抓取各行业工资水平的实时数据,通过对调查数据与抓取数据的比较分析判别住户调查中工资数据的准确性。 At present,there are still some problems in web data acquisition technology,such as difficulty to parse dynamic web pages,slow web crawler speed,inaccurate content capture,etc.In order to avoid these problems,this paper designs a set of multi-threaded webpage data acquisition and parsing algorithm based on Selenium.The data acquisition part of the algorithm mainly applies the Selenium Library in python for automatic operation and manipulating browsers,which perfectly solves the problem of obtaining dynamic and static page data information,no interface version of the browser,multi-threaded web crawler technology,and the use of keyword discriminant program,with the speed of web crawler and the accuracy of capture content improved to a great extent.And the algorithm is applied to the accuracy discrimination of household wage income survey data in the targeted poverty alleviation.Finally,taking the talent market network of a certain regional as an example,the paper captures the real-time data of wage levels in various industries to determine the accuracy of wage data in household surveys through the comparative analysis of survey data and capture data.
作者 沈承放 莫达隆 黄文韬 Shen Chengfang;Mo Dalong;Huang Wentao(School of Mathematics and Computer Science,Hezhou University,Hezhou Guangxi 542899,China;School of Mathematics and Statistics,Guangxi Normal University,Guilin Guangxi 541004,China)
出处 《统计与决策》 CSSCI 北大核心 2021年第7期52-56,共5页 Statistics & Decision
基金 国家社会科学基金西部项目(18XTJ002) 广西师范大学创新计划项目(XYCSZ2019088)
关键词 网页数据采集算法 住户调查 网络爬虫 多线程 精准扶贫 PYTHON SELENIUM webpage data acquisition algorithm household survey web crawler multithreading targeted poverty alleviation python Selenium
  • 相关文献

参考文献8

二级参考文献49

  • 1周立柱,林玲.聚焦爬虫技术研究综述[J].计算机应用,2005,25(9):1965-1969. 被引量:153
  • 2徐远超,刘江华,刘丽珍,关永.基于Web的网络爬虫的设计与实现[J].微计算机信息,2007,23(21):119-121. 被引量:36
  • 3TUMASJAN A, SPRENGER T O, SANDNER P G, et al. Predicting elections with Twitter: what 140 characters reveal about political sentiment[C] // Proceedings of the Fourth International AAAI Conference on Weblogs and Social Media. Madison: AAAI Press, 2010, 10: 178-185.
  • 4WELCH M J, SCHONFELD U, HE D, et al. Topical semantics of twitter links[C] // Proceedings of the Fourth ACM International Conference on Web Search and Data Mining. New York: ACM Press, 2011: 327-336.
  • 5CARLISLE J E, PATTON R C. Is social media changing how we understand political engagement? An analysis of Facebook and the 2008 presidential election[J]. Political Research Quarterly, 2013, 66(4): 883-895.
  • 6CUNLIFFE D, MORRIS D, PRYS C. Young bilinguals' language behaviour in social networking sites: the use of welsh on Facebook[J]. Journal of Computer-Mediated Communication, 2013, 18(3): 339-361.
  • 7STRAFLING N, KRAMER N C. Learning together on Facebook et al. The influence of social aspects and personality on the usage of social media for study related exchange [J]. Gruppendynamik und Organisationsberatung, 2013, 44(4): 409-428.
  • 8DUAN J Y, DHOLAKIA N. The reshaping of Chinese consumer values in the social media era: exploring the impact of Weibo [J]. Journal of Macromarketing, 2013, 33(4): 402-403.
  • 9HUANG R, SUN X. Weibo network, information diffusion and implications for collective action in China [J]. Information Communication and Society, 2014, 17(1): 86-104.
  • 10MAZO J. Blocked on Weibo: what gets suppressed on China's version of Twitter (and why) [J]. Survival, 2013, 55(6): 191-192.

共引文献227

同被引文献13

引证文献2

二级引证文献3

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部