期刊文献+

Sp-IEclat:一种大数据并行关联规则挖掘算法 被引量:18

Sp-IEclat:A Big Data Parallel Association Rule Mining Algorithm
下载PDF
导出
摘要 针对大数据环境下关联规则数据挖掘效率不高的问题,采用Eclat算法使用垂直数据库将事务的合并转换成集合操作的方法。研究了一种大数据并行关联规则挖掘算法-Sp-IEclat(Improved Eclat algorithm on Spark Framework),该算法基于内存计算的Spark框架,减少磁盘输入输出降低I/O负载,使用位图运算降低交集的时间代价并减少CPU占用,采用前缀划分的剪枝技术减少求交集运算的数据量,降低运算时间。使用mushroom数据集和webdocs数据集在两种大数据平台下实验,结果表明,Sp-IEclat算法的时间效率优于MapReduce框架下的Eclat算法及Spark框架下的FP-Growth算法和Eclat算法。从对集群的性能监控得到的数值表明,同Spark框架下的FP-Growth算法和Eclat算法相比,Sp-IEclat算法的CPU占用和I/O集群负载都较小。 Aiming at the problem of inefficient data mining of association rules in a big data environment,the Eclat algorithm is used to use a vertical database to convert the merging of transactions into collective operations.We researched a big data parallel association rule mining algorithm-Sp-IEclat(Improved Eclat algorithm on Spark Framework).The algorithm is based on the Spark framework of memory computing,reduces disk input and output,reduces I/O load,and uses bitmap operations to reduce the time of intersection and CPU usage.The pruning technique of prefix division is used to reduce the amount of data in the intersection operation to reduce the operation time.The mushroom dataset and the webdocs dataset are used to test under two big data platforms.The experimental results show that the time efficiency of the Sp-IEclat algorithm is better than the Eclat algorithm under the MapReduce framework and the FP-Growth algorithm and the Eclat algorithm under the Spark framework.The value obtained from the performance monitoring of the cluster shows that,compared with the FP-Growth algorithm and the Eclat algorithm under the Spark framework,the CPU usage and I/O cluster load of Sp-IEclat are smaller.
作者 李成严 辛雪 赵帅 冯世祥 LI Cheng-yan;XIN Xue;ZHAO Shuai;FENG Shi-xiang(School of Computer Science and Technology,Harbin University of Science and Technology,Harbin 150080,China)
出处 《哈尔滨理工大学学报》 CAS 北大核心 2021年第4期109-118,共10页 Journal of Harbin University of Science and Technology
基金 黑龙江省教育厅科学技术研究项目(12541142).
关键词 大数据 关联规则挖掘 频繁项集 Spark弹性分布式数据集 MAPREDUCE框架 big data association rule data mining frequent itemset Spark resilient distributed dataset(RDD) MapReduce framework
  • 相关文献

参考文献1

共引文献6

同被引文献184

引证文献18

二级引证文献7

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部