期刊文献+

基于大数据的网络数据采集研究与实践 被引量:1

Research and Practice of Network Data Acquisition based on Big Data
下载PDF
导出
摘要 在微博大数据环境下,文章以舆情数据采集、用户行为分析为应用背景,提出了一种爬虫数据采集系统的设计与实现方案。该方案主要采用的是聚焦爬虫和增量式爬虫相结合,同时基于内容评价的爬行策略,对用户给定的关键词进行搜索,并在其发生变化时对相关内容进行更新,从而实现数据采集的及时性和有效性。通过实际数据采集效果来看,本方案单机日数据采集量约为88万条,实际应用中用户可根据需求自定义爬取数据的速度,也可通过增加分布式爬虫数量提升爬取数据量与速度。 In the context of Weibo big data,this paper proposes to design and implement a crawler data acquisition system based on the application background of public opinion data collection and user behavior analysis.In this solution,the focused crawler is combined with the incremental crawler,and a content evaluation-based crawling strategy is used to search for the keywords given by the user and update the relevant content with the changes of the keywords,so as to achieve the timeliness and effectiveness of data acquisition.According to the actual data acquisition effect,the daily data acquisition volume of a single machine in this solution is about 1 million pieces.In practical application,users can customize the speed of crawling data according to their needs,and can also increase the amount and speed of crawling data by increasing the number of distributed crawlers.
作者 霍英 李小帆 丘志敏 李彦廷 HUO Ying;LI Xiaofan;QIU Zhimin;LI Yanting(School of Information Engineering,Shaoguan University,Shaoguan 512005,China;School of Intelligent Engineering,Shaoguan University,Shaoguan 512005,China)
出处 《软件工程》 2023年第4期28-32,共5页 Software Engineering
基金 广东省哲学社会科学规划学科共建项目(GD18XXW07) 广东省自然科学基金项目(2021A1515011803).
关键词 大数据 数据采集 网络爬虫 big data data acquisition network crawler
  • 相关文献

参考文献7

二级参考文献87

共引文献33

引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部