摘要
在研究Web结构挖掘经典算法Pagerank和云计算关键技术Mapreduce的基础上,将Pagerank算法与Mapreduce编程模型结合,针对基于并行Pagerank算法运行大数据集时面临的每次迭代访问HDFS导致I/O消耗增加、每次迭代在混合阶段和排序阶段时耗过多的问题提出了两个改进算法。一个是利用矩阵分块思想的并行Pagerank改进算法;另一个是减少HDFS访问次数的并行Pagerank改进算法。最后利用Hadoop搭建云环境,在实验环境下分析了不同的BlockSize参数对于计算性能的影响。并在云环境下面向不同的Web数据集,测试了原算法和改进算法的性能。结果表明,改进后的算法分别在结果集的空间占用方面和总迭代时间方面具有一定的优越性。
Pagerank algorithm and Mapreduce programming model are combined based on studying both of them. In consideration of the problems of Pagerank when running large datasets, two improvements are put forward. First, the idea of matrix part,ion to reduce the time consumption in mixing and sorting period of Pagerank in iteration is implied. Second, an algorithm based on reducing the number of HDFS accessing is proposed. Finally, the performances of the three algorithms under different web datasets are tested and compared. The result proves that the improved algorithm has advantages in space usage and iteration time.
出处
《计算机时代》
2012年第10期30-33,37,共5页
Computer Era