摘要
面对多媒体社交网络中在线视频的爆炸式增长,使用单机模式下爬虫提取新视频页面的效率低下,为此,提出一种基于Map/Reduce的并行算法,大大提高了爬虫的效率;但是为了进一步改善数据冗余问题,减少过时页面的更新,改进了一种精度感知增量更新算法,利用监控技术监控网页变化情况,分析网页更新模式,增加新鲜度评估和降维处理,使用混合整数二次规划方法为发生更改的网页制定最优的刷新策略;实验证明,相比单机模式下定期频繁的刷新策略,该并行增量方法以原刷新代价的36.7%获得了79%的信息精确度,爬虫效率提高了167倍。
In response to the explosive growth of online video in multimedia social networks,the use of crawlers in stand-alone mode to extract new video pages is inefficient.a parallel algorithm based on Map/Reduce is proposed,which greatly improves the crawler efficiency.But in order to further handle the problem of data redundancy and reduce outdated page updates,a improved accuracy-aware incremental updating algorithm is proposed.The monitoring technique is used to monitor the web page changes,analyze the web page update mode,increase the freshness assessment and dimensionality reduction,and use the improved mixed integer quadratic programming(MIQP)so to make the optimal refresh strategy.Experiments show that compared with the frequent refresh strategy in the stand-alone mode,the parallel incremental method achieves 79%of the information accuracy with the original refresh rate of 36.7%,and the crawler efficiency is improved by 167 times.
作者
刘芳云
张志勇
李玉祥
Liu Fangyun;Zhang Zhiyong;Li Yuxiang(College of Information Engineering,Henan University of Science and Technology,Luoyang 471023,China)
出处
《计算机测量与控制》
2018年第10期269-275,308,共8页
Computer Measurement &Control
基金
国家自然科学基金(61772174
61370220)
河南省科技创新杰出人才计划项目(174200510011)
河南省高校科技创新团队支持计划项目(15IRTSTHN010)