期刊文献+

基于MapReduce的高维数据频繁项集挖掘 被引量:8

Frequent Itemset Mining of High-Dimensional Data Based on MapReduce
下载PDF
导出
摘要 传统的数据挖掘算法在面向大规模高维数据的挖掘过程中,存在数据特征捕捉准确率低、节点负载不均衡、数据交互频繁、频繁项集紧凑化程度低等问题。提出基于MapReduce的并行挖掘算法PARDG-MR,结合高维数据特征,设计基于维度粒化算法和负载均衡算法的DGPL策略,并对数据进行预处理,以解决高维复杂数据特征属性捕捉困难及数据划分中节点负载不均衡的问题。通过构建基于PJPFP-Tree树的频繁项集并行挖掘策略PARM,实现频繁项集的并行化分组过程,从而提高数据处理的运行效率。在此基础上,提出基于剪枝前缀推论的整合节点剪枝算法PJPFP,提高频繁项集挖掘过程中的剪枝效率,增强频繁项集的紧凑化程度。在Webdocs、NDC、Gisette 3个数据集上的实验结果表明,相比PFP-growth、PWARM、MRPrePost算法,该算法的运行时间平均缩短了约20%,能够有效提高数据挖掘效率且降低内存空间。 In the mining process of large-scale high-dimensional data,the traditional data mining algorithm has some problem,such as low accuracy of data feature capture,unbalanced node load,frequent data interaction,and low compactness of frequent itemset.Therefore,this paper proposes a parallel mining algorithm,PARDG-MR which is based on MapReduce. By combining the characteristics of high-dimensional data,a DGPL strategy based on the dimensional granulation algorithm and load balancing algorithm are designed.Data are preprocessed to solve the problems of difficult feature attribute capture of high-dimensional complex data and unbalanced node load in the data division. The parallel grouping process of frequent itemset is realized by constructing a parallel mining strategy PARM of frequent itemset based on the PJPFP-Tree to improve the operation efficiency of data processing.On this basis,it proposes an integrated node pruning algorithm PJPFP based on the pruning prefix inference,which improves the pruning efficiency in the process of frequent itemset mining to enhance the compactness of frequent itemset and improve the overall mining efficiency of the algorithm.The experimental results on Webdocs,NDC,and Gisette data sets show that compared with PFP-growth,PWARM and MRPrePost algorithms,the running time of PARDG-MR is shorter by approximately 20% on average,therefore,it is more effective and efficient in data mining.
作者 赵欣灿 朱云 毛伊敏 ZHAO Xincan;ZHU Yun;MAO Yimin(School of Science,Jiangxi University of Science and Technology,Ganzhou,Jiangxi 341000,China;School of Information Engineering,Jiangxi University of Science and Technology,Ganzhou,Jiangxi 341000,China)
出处 《计算机工程》 CAS CSCD 北大核心 2022年第3期81-89,共9页 Computer Engineering
基金 国家重点研发计划(2018YFC1504705) 国家自然科学基金(41562019) 江西省教育厅科技项目(GJJ151528,GJJ151531)。
关键词 高维数据 频繁项集 维度粒化 并行化 候选剪枝策略 high-dimensional data frequent itemset dimensional granulation parallel candidate pruning strategy
  • 相关文献

参考文献14

二级参考文献95

共引文献143

同被引文献87

引证文献8

二级引证文献5

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部