摘要
微博正逐步成为公共信息传播的主要社交媒体,高效地获取微博数据对于网络舆情分析具有重要意义。以新浪微博为研究对象,研究了通过微博API、模拟登录和构造访客Cookie进行数据采集的3种方案,提出了一种多策略融合的微博数据采集方案。针对模拟登录的方案设计实现了自适应的并发采集算法,使数据采集较为稳定高效;针对构造访客Cookie的方案设计实现了高可用代理池模块,进一步提高了数据采集效率。实验结果表明,基于模拟登录的自适应并发采集策略和构造访客Cookie融合的方案能够高效、全面、稳定地获取微博数据。
Microblog is becoming the main social media to spread public information, efficient acquisition of microblog data is important to the analysis of online public opinion. Taking Microblog as the research object, there are three data collection strategies through microblog API, simulated login technology and visitor cookie are studied. A data collection method for microblog based on fusion strategy is proposed. An adaptive concurrent data acquisition algorithm is designed and implemented for the web crawler based on simulated login technology. A high available IP proxy pool is designed to accelerate data acquisition for the web crawler based on visitor Cookie. Experimental results show that the fusion strategy is more effective, complete and stable in microblog data collection.
作者
王培名
陈兴蜀
王海舟
王文贤
WANG Pei-ming;CHEN Xing-shu;WANG Hai-zhou;WANG Wen-xian(College of Computer Science,Sichuan University,Chengdu 610065,Sichuan,China;College of Cybersecurity,Sichuan University,Chengdu 610065,Sichuan,China;Cybersecurity Research Institute,Sichuan University,Chengdu 610065,Sichuan,China)
出处
《山东大学学报(理学版)》
CAS
CSCD
北大核心
2019年第5期28-36,43,共10页
Journal of Shandong University(Natural Science)
基金
国家自然科学基金资助项目(61802270,61802271)
国家“双创”示范基地之变革性技术国际研发转化平台资助项目(C700011)
四川省重点研发资助项目(2018G20100)
四川省科技支撑计划项目(2016GZ0038)
中央高校基本科研业务费专项资金资助(2017SCU11065)