摘要
由候选项集G2生成频繁2-项集岛是关联规则Apriori算法的一个瓶颈。直接哈希修剪(DHP)算法利用一个生成的Hash表见H2减G2中无用的候选项集,以此提高厶的生成效率。但传统DHP算法是一个串行算法,不能有效处理较大规模数据。针对这一问题,提出DHP的并行化算法——H_DHP。首先,对DHP算法并行化策略的可行性进行了理论分析与证明;其次,基于Hadoop平台,把Hash表以的生成以及频繁项集L1、L3~Lk的生成方法进行了并行实现,并借助Hbase数据库生成关联规则。仿真实验结果表明:与传统DHP算法相比,H_DHP算法在数据的处理时间效率、处理数据集的规模大小,以及加速比和可扩展性等方面都有较好的性能。
It is a bottleneck of Apriori algorithm for mining association rules that the candidate set C2 is used to generate the frequent 2-item set L2. In the Direct Hashing and Pruning (DHP) algorithm, a generated Hash table H2 is used to delete the unused candidate item sets in C2 for improving the efficiency of generating L2. However, the traditional DI-IP is a serial algorithm, which cannot effectively deal with large scale data. In order to solve the problem, a DHP parallel algorithm, termed H DHP algorithm, was proposed. First, the feasibility of parallel strategy in DHP was analyzed and proved theoretically. Then, the generation method for the Hash table H2 and frequent item sets L1, L3 - Lk was developed in parallel based on Hadoop, and the association rules were generated by Hbase database. The simulation experimental results show that, compared with the DHP algorithm, the H_DHP algorithm has better performance in the processing efficiency of data, the size of the data set, the speedup and scalability.
出处
《计算机应用》
CSCD
北大核心
2016年第12期3280-3284,3291,共6页
journal of Computer Applications
基金
国家科技支撑计划项目(2014BAH11F01
2014BAH11F02)
四川省科技支撑计划项目(15GZ0079)~~