期刊文献+

基于Spark的PFP-Growth并行算法优化实现 被引量:6

Optimization of parallel FP-Growth algorithm based on Spark
下载PDF
导出
摘要 随着数据量的增大,FP-Growth算法压缩数据思想的优势就体现出来,基于MapReduce框架的PFP-Growth算法实现该算法在Hadoop平台上的并行化,但是MapReduce框架每次对作业进行操作都要将中间结果输出存储到磁盘,影响算法的效率。为了提高关联挖掘的效率,基于Spark平台,运用均衡分组的思想对该算法进行改进,同时在对具有很长前缀情况进行共享前缀的拆分,通过4个步骤使IPFP-Growth算法在Spark上实现。实验结果表明在Spark平台上优化过后的算法在性能上要优于PFP-Growth算法。 The advantage of the FP-Growth algorithm for compressing data is reflected with the increasing of the data size.With the MapReduce framework,the PFP-Growth algorithm can be parallelized on the Hadoop platform. However,when processing tasks with the MapReduce framework,the intermediate results need to be written to the disk,which will affect the efficiency of the algorithm. Therefore,based on Spark platform,this algorithm was improved according to the concept of balanced grouping to improve the efficiency of association mining. In addition,if there is a long prefix,the improved algorithm will split the shared prefix. The IPFP-Growth is implemented in Spark through four steps. The experimental results show that the performance of the algorithm optimized in Spark is superior to that of the PFP-Growth algorithm.
作者 方向 张功萱
出处 《现代电子技术》 北大核心 2016年第8期9-13,共5页 Modern Electronics Technique
基金 江苏省973项目(BK2011022) 国家自然科学基金重点项目(612724420)
关键词 并行化 SPARK 关联挖掘 PFP-Growth parallelization Spark association mining PFP-Growth
  • 相关文献

参考文献7

  • 1HAN J, PEI J, YIN Y. Mining frequent patterns without candi- date generation [J]. ACM SIGMOD record, 2000, 29(2): 1-12.
  • 2LI H, WANG Y, ZHANG D, et al. PFP: parallel FP-Growth for query recommendation [C]// Proceedings of the 2008 ACM Con- ference on Recommender Systems. rS.I.I: ACM. 2008; 107-114+.
  • 3王智钢,王池社,马青霞.分布式并行关联规则挖掘算法研究[J].计算机应用与软件,2013,30(10):113-115. 被引量:13
  • 4曾志勇,杨呈智,陶冶.负载均衡的FP-growth并行算法研究[J].计算机工程与应用,2010,46(4):125-126. 被引量:10
  • 5吕雪骥,李龙澍.FP-Growth算法MapReduce化研究[J].计算机技术与发展,2012,22(11):123-126. 被引量:18
  • 6ZHOU L, ZHONG Z, CHANG J, et al. Balanced parallel FP- Growth with MapReduce [C]// Proceedings of 2011 IEEE Youth Conference on Information Computing and Telecommunica- tions. [S.1.]: IEEE, 2010: 243-248.
  • 7QIU H, GU R, YUAN C, et al. YAFIM: A parallel frequent itemset mining algorithm with Spark [C]// 2014 IEEE Interna- tional Parallel & Distributed Processing Symposium Work- shops. [S.1.]: IEEE, 2014: 1664-1671.

二级参考文献28

  • 1谈克林,孙志挥.一种FP树的并行挖掘算法[J].计算机工程与应用,2006,42(13):155-157. 被引量:10
  • 2Agrawal R,Imielinski T,Swami A.Mining association rules between sets of items in large databases[C]//Proceedings of the ACM SIG- MOD International Conference Management of Date,Washington,1993:207-216.
  • 3Ha n J,Kamber M.Data mining:Concepts and techniques[M].Beijing: High Education Press, 2001.
  • 4Agrawal R,Imielinski T,Swami A.Mining association rules between sets of items in large databases[C]//Proc 1993 ACM-SIGMOD Int Conf Management of Data, Washington, DC, May 1993 : 207-216.
  • 5Savasere A,Omiecinski E,Navathe S.An efficient algorithm for mining association rules in large databases[C]//Proc of the 21st VLDB Conference, Zurich, Switzerland, 1995 : 432-443.
  • 6Agrawal R C,Agarwal C,Prasad V V V.A tree projection algorithm for generation of frequent itemscts[J].Journal of Parallel and Distributed Computing:Special Issue on High Performance Data Mining, 2000.
  • 7Han J W,Pei J,Yin Y W,et al.Mining frequent patterns without candidate generation[C]//Proc ACM-SIGMOD Int Conf on Management of Data(SIGMOD'00),DalIas,TX,2000:1-12.
  • 8Han J,Pei J,Yin Y,et al.Mining frequent patterns without candidate generation : A frequent-pattern tree approach[J].Data Mining and Knowledge Discovery, 2004,8 : 53-87.
  • 9Han Jiawei, Pei Jian, ytin Yiwen. Mining frequent patterns without candidate generation [ C ]//SIGMOD' 00. [ s. 1.] :[ s. n. ] ,2000.
  • 10Agrawal R, Shafer J C. Parallel mining of association rules[J].1EEE Transactions on knowledge and date engineering, 1996,8(6) :962-969.

共引文献34

同被引文献42

引证文献6

二级引证文献55

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部