摘要
针对FP-Growth算法面临大规模数据库时空效率不高的问题,提出一种面向计算机集群的并行算法。采用投影方法直接寻找频繁项的条件数据库,将挖掘条件数据库的工作分化成若干独立的子任务,分配到集群中的节点上并行实现,由中央节点汇总结果并输出。结果证明,该算法不仅能够提高计算速度,解决数据库规模过大时内存溢出的情况,且具有良好的延展性。
When the dataset size is huge, both the memory usage and computational cost of FP-Growth algorithm are expensive. This paper proposes a parallel algorithm, which is designed to run on the PC cluster. This algorithm finds all the conditional pattern bases of frequent items by the projection method. It splits the mining task into number of independent sub-tasks, executes these sub-tasks in parallel on nodes and aggregates the sub-results back for the final result. Experiments show that this parallel algorithm not only can accelerate the computational speed, avoids the memory overflow, but also achieves much better scalability than the FP-Growth algorithm.
出处
《计算机工程》
CAS
CSCD
北大核心
2009年第20期71-72,75,共3页
Computer Engineering
基金
国家自然科学基金资助项目"高维稀疏数据聚类研究"(70771007)