期刊文献+

基于网页链接分类的PageRank并行算法 被引量:6

PageRank parallel algorithm based on Web link classification
下载PDF
导出
摘要 针对串行PageRank算法在处理海量网页数据时效率低下的问题,提出一种基于网页链接分类的PageRank并行算法。首先,将网页按照网页所属网站分类,为来自不同站点的网页设置不同的权重;其次,利用Hadoop并行计算框架,结合MapReduce分而治之的特点,并行计算网页排名;最后,采用一种包含3层:数据层、预处理层、计算层的数据压缩方法,对并行算法进行优化。实验结果表明,与串行PageRank算法相比,所提算法在最好情况下结果准确率提高了12%,计算效率提高了33%。 Concerning the problem that the efficiency of serial PageRank algorithm is low in dealing with mass Web data,a PageRank parallel algorithm based on Web link classification was proposed. Firstly, the Web was classified according to its Web link, and the weights of different Web which was from diverse websites were set variously. Secondly, with the Hadoop parallel computation platform and MapReduce which has the characteristics of dividing and conquering, the Webpage ranks were computed parallel. At last, a data compression method of three layers including data layer, pretreatment layer and computation layer was adopted to optimize the parallel algorithm. The experimental results show that, compared with the serial PageRank algorithm, the accuracy of the proposed algorithm is improved by 12% and the efficiency is improved by 33% in the best case.
出处 《计算机应用》 CSCD 北大核心 2015年第1期48-52,共5页 journal of Computer Applications
关键词 链接分类 HADOOP PAGERANK MAPREDUCE 数据压缩 link classification Hadoop PageRank MapReduce data compression
  • 相关文献

参考文献11

  • 1PAGE L, BRIN S, MOTWANI R, et al. The PageRank citation ranking: bringing order to the Web [ C]//Proceedings of the 7th In- ternational World Wide Web Conference. Brisbane: [ s. n. ], 1998: 161 - 172.
  • 2LANGVILLE A, MEYER C D. Google's PageRank and beyond: the science of search engine rankings [ M]. Princeton: Princeton Uni- versity Press, 2006:1 -2.
  • 3WHITE T. Hadoop: the definitive guide [ M]. Sebastopol: O'Reilly Media, 2009:103 - 108.
  • 4DEAN J, CHEMAWAT S. MapReduce: simplified data processing on large clusters [ C]//OSDI 2004: Proceedings of the 6th Sympo- sium on Operating System Design and Implementation. Berkeley: USENIX Association, 2004:137-150.
  • 5陈宫,牛秦洲.基于MapReduce的PageRank算法的研究[J].微电子学与计算机,2012,29(5):81-85. 被引量:5
  • 6LIN J, SCHATZ M. Design patterns for efficient graph algorithms in MapReduce [ C]//MLG'10: Proceedings of the Eighth Workshop on Mining and Learning with Graphs. New York: ACM, 2010:78 - 85.
  • 7张永,尹传晔,吴崇正.基于MapReduce的PageRank算法优化研究[J].计算机应用研究,2014,31(2):431-434. 被引量:5
  • 8廖松博,陶岳,何震瀛,汪卫.GCPR:一种在MapReduce平台上基于图划分的PageRank加速方法[J].小型微型计算机系统,2012,33(6):1195-1201. 被引量:2
  • 9VISWANATHAN A. A guide to using LZO compression in Hadoop [J]. Linux Journal, 2012, 2012(220) : Article No. 1.
  • 10张宇,宋巍,刘挺,李生.基于URL主题的查询分类方法[J].计算机研究与发展,2012,49(6):1298-1305. 被引量:14

二级参考文献24

  • 1Broder A, Fontoura M, Gabrilovich E, et al. Robust classification of rare queries using Web knowledge [C] //Proc of ACM SIGIR 2007. New York: ACM, 2007: 231-238.
  • 2Bennett P N, Krysta S, Dumais S T. Classification enhanced ranking [C] //Proe of ACM WWW 2010. New York: ACM, 2010:111-120.
  • 3Ryen W W, Peter B, Chen L. Predicting user interests from contextual information [C]//Proc of ACM SIGIR 2009. New York, ACM, 2009 : 363-370.
  • 4Broder A. A taxonomy of web search [J]. ACM SIGIR Forum, 2002: 36(2): 3-10.
  • 5Shen Dou, Pan Rong, Sun Jiantao, et al. Query enrichment for Web-query classification [J]. ACM Trans on Information Systems, 2006, 24(3): 320-352.
  • 6Li Ying, Zheng Zijian, Dai Honghua. KDD CUP-2005 report, Facing a great challenge [J]. ACM SIGKDD Explorations, 2005, 7(2): 91-99.
  • 7Beitzel S M, Jensen E C, Lewis D D, et al. Automatic classification of web queries using labeled and unlabeledtraining data[J]. ACM Trans on Information Systems, 2007, 25(2) (Article No. 9).
  • 8Li Xiao, Wang Yeyi, Acero A. Learning query intent from regularized click graphs [C] //Proc of ACM SIGIR 2008. New York: ACM, 2008: 339-346.
  • 9Hu Jian, Wang Gang, Fred L, et al. Understanding user's query intent with Wlkipedla [C]//Proc of ACM WWW 2009. New York: ACM, 2009:471-480.
  • 10! Shen Dou, Li Ying, Li Xiao, et al. Product quer1 l classification [C] //Proc of ACM CIKM 2009. New Yorkt / ACM, 2009 : 741-750.

共引文献21

同被引文献63

  • 1金晶,唐丽娟,朱丹.农村远程教育中微课教学资源的制作与推广[J].湖南农业科学,2013(9):116-119. 被引量:8
  • 2王健,甘国辉.多维农业信息分类体系[J].农业工程学报,2004,20(4):152-156. 被引量:27
  • 3郦晶,朱海鹏,刘世洪,郭曼,李春华.农业信息化标准分类方法研究[J].安徽农业科学,2007,35(31):10144-10145. 被引量:3
  • 4罗森林,马俊,潘丽敏编著.数据挖掘理论与技术[M].北京:电子工业出版社,2013.
  • 5王东.网络农业教育的国际展望[J].远程教育杂志,2007,25(1):22-25. 被引量:3
  • 6Page L, Brin S, Motwani R, et al. The Pagerank Citation Ranking; Bringing Order to the Web.Techical Report, Standford Digital Library Technologies Project,2011.
  • 7Kleinberg JM. Authoritative Sources in a Hyperlinked Environment. Journal of the ACM, 2012,46(5):604-632.
  • 8Chakrabarti S, Dom B, Raghavan P, et al. Automatic Resource List Compilation by Analyzing Hyperlinked Resource List Compilation by Analyzing Hyperlink Structure and Assocaitaed Text. [2013-11-17]. http://citeseer.ist.psu.edu/ chakrabarti98automatic.htm.
  • 9PoweredBy-HadoopWiki. [2013-11-17]. http://wiki,apache. org/hadoop/PoweredBy.
  • 10Borthakur D. HDFS Architecture. [2012-11-17]. http://hadoop. apache.org/commord docs/current/hdfs design.

引证文献6

二级引证文献24

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部