期刊文献+

基于Hadoop的分布式并行增量爬虫技术研究 被引量:4

Research on Distributed Parallel Incremental Crawlers Technology Based on Hadoop
下载PDF
导出
摘要 面对多媒体社交网络中在线视频的爆炸式增长,使用单机模式下爬虫提取新视频页面的效率低下,为此,提出一种基于Map/Reduce的并行算法,大大提高了爬虫的效率;但是为了进一步改善数据冗余问题,减少过时页面的更新,改进了一种精度感知增量更新算法,利用监控技术监控网页变化情况,分析网页更新模式,增加新鲜度评估和降维处理,使用混合整数二次规划方法为发生更改的网页制定最优的刷新策略;实验证明,相比单机模式下定期频繁的刷新策略,该并行增量方法以原刷新代价的36.7%获得了79%的信息精确度,爬虫效率提高了167倍。 In response to the explosive growth of online video in multimedia social networks,the use of crawlers in stand-alone mode to extract new video pages is inefficient.a parallel algorithm based on Map/Reduce is proposed,which greatly improves the crawler efficiency.But in order to further handle the problem of data redundancy and reduce outdated page updates,a improved accuracy-aware incremental updating algorithm is proposed.The monitoring technique is used to monitor the web page changes,analyze the web page update mode,increase the freshness assessment and dimensionality reduction,and use the improved mixed integer quadratic programming(MIQP)so to make the optimal refresh strategy.Experiments show that compared with the frequent refresh strategy in the stand-alone mode,the parallel incremental method achieves 79%of the information accuracy with the original refresh rate of 36.7%,and the crawler efficiency is improved by 167 times.
作者 刘芳云 张志勇 李玉祥 Liu Fangyun;Zhang Zhiyong;Li Yuxiang(College of Information Engineering,Henan University of Science and Technology,Luoyang 471023,China)
出处 《计算机测量与控制》 2018年第10期269-275,308,共8页 Computer Measurement &Control
基金 国家自然科学基金(61772174 61370220) 河南省科技创新杰出人才计划项目(174200510011) 河南省高校科技创新团队支持计划项目(15IRTSTHN010)
关键词 HADOOP集群 分布式爬虫 并行爬虫 增量爬虫 刷新策略 Hadoop cluster distributed crawler parallel crawler incremental crawler refresh strategy
  • 相关文献

参考文献4

二级参考文献70

共引文献166

同被引文献35

引证文献4

二级引证文献13

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部