摘要
针对现有算法存储结构简单、生成大量冗余的候选集、时间和空间复杂度高、挖掘效率不理想的情况,为了进一步提高关联规则算法挖掘频繁集的速度,优化算法的执行性能,提出基于内存结构改进的关联规则挖掘算法。该算法基于Spark分布式框架,分区并行挖掘出频繁集,提出在挖掘过程中利用布隆过滤器进行项目存储,并对事务集和候选集进行精简化操作,进而达到优化挖掘频繁集的速度、节省计算资源的目的。算法在占用较少内存的条件下,相比于YAFIM和MR-Apriori算法,在挖掘频繁集效率上有明显的提升,不但能较好地提升挖掘速度,降低内存的压力,而且具有很好的可扩展性,使得算法可以应用到更大规模的数据集和集群,从而达到优化算法性能的目的。
In order to further improve the speed of the association rules mining frequent sets and optimize the execution performance of the algorithm,this paper proposed an association rule mining algorithm based on improved memory structure.Based on the Spark distributed framework,the proposed algorithm mined frequent sets in parallel.It used the Bloom filter to store the project in the mining process,and simplified the operation of the transaction set and the candidate set,so as to optimize the speed of mining frequent sets and save the computing resources.Compared with the YAFIM and the MR-Apriori algorithm,the proposed algorithm has a significant improvement in the efficiency of mining frequent sets under the condition of occupying less memory.The algorithm can not only improve the mining speed and reduce the memory pressure,but also has good scalability,so that the algorithm can be applied to larger data sets and clusters to optimize the performance.
作者
王永贵
谢南
曲海成
Wang Yonggui;Xie Nan;Qu Haicheng(School of Software,Liaoning Technical University,Huludao Liaoning 125105,China)
出处
《计算机应用研究》
CSCD
北大核心
2020年第1期167-171,共5页
Application Research of Computers
基金
国家自然科学基金资助项目(61404069)
国家自然科学基金青年基金资助项目(41701479).