摘要
关联规则挖掘一直都是数据挖掘的重要任务,然而随着大数据时代的到来,数据规模呈指数形式增长,传统的串行挖掘算法已经面临着内存和计算资源不足等问题。针对上述问题,提出了一种基于MapRedcue并行编程模型的改进Eclat算法--IMREclat算法。IMREclat算法使用2个MapReduce任务,主要分为3个阶段:首先,平均划分事务数据库,并行挖掘频繁2项集。然后,将频繁2项集转化为垂直数据格式并利用二进制存储事务列表,按照等价类和其权重值分组。最后,将分组后的数据作为输入,通过利用预剪枝性质改进后的Eclat算法并行挖掘所有的频繁项集。实验表明,IMREclat算法在运行时间上优于现有的MREclat算法,并有良好的扩展性能。
The mining of association rules has always been an important task of data mining. However, with the advent of the era of big data, the data scale has grown exponentially. The traditional serial mining algorithms have faced problems such as the insufficient of memory and computing resources. Regarding the issue above, the IMREclat algorithm is proposed, which is an improved Eclat algorithm based on the MapReduce parallel programming model. The IMREclat algorithm uses two MapReduce tasks, which are mainly divided into three phases: Firstly, the transaction database is divided equally, and the frequent 2-itemsets are drilled in parallel. Secondly, the frequent 2-itemsets are converted into a vertical data format, and the binary storage transaction list is used to group by the equivalence class and its weight value. Finally, the grouped data is used as input, and all frequent item sets are mined in parallel by using the improved Eclat algorithm with pre-pruning properties. The experiments show that the IMREclat algorithm outperforms the existing MREclat algorithm in running time and has good expansion performance.
作者
向春梅
陈超
XIANG Chunmei;CHEN Chao(College of Communication Engineering,Chengdu University of Information Technology,chengdu 610255,China)
出处
《成都信息工程大学学报》
2019年第4期369-374,共6页
Journal of Chengdu University of Information Technology
基金
四川省科技计划资助项目(18ZDYF3278)