摘要
FP-Growth是频繁模式挖掘的经典算法,能够在不产生候选集的情况下生成所有的频繁模式,效率与Apri-ori算法相比有巨大提高,然而FP-Growth算法在挖掘频繁模式过程中需要递归构建大量的条件FP-tree,并分别针对这些条件FP-tree进行挖掘,时间及空间效率不高,在实际应用中存在很大局限性。计算机集群是由多台普通计算机设备通过特定方式结合在一起构成的并行处理系统,属于分布式计算环境,具有计算能力强大、性价比高、灵活等优势。本文提出一种面向计算机集群的并行挖掘算法Gridify FP-Growth,该算法以FP-Growth为基础,通过任务划分的形式,将计算任务分配到计算机集群中各个计算节点上执行,充分利用各个节点的计算资源,最后汇总各节点的计算结果。实验证明Gridify FP-Growth算法不会牺牲计算的准确性,并可以大幅度缩短计算时间,有效缓解计算大规模数据库时的内存压力。
FP-Growth is the most popular algorithm for frequent patterns mining, which can produce all frequent patterns without generating candidate item sets. FP-Growth has better performance than previously reported algorithms such as Apriori. Nevertheless, the great amount of conditional pattern base and conditional FP-tree recursively generated during mining frequent patterns limits practical feasibility of FP-Growth algorithm when facing large scale data warehouse. Further performance improvement can be expected from parallel execution. PC cluster is a group of PC connected together through definite ways. It is a distributed computing environment and has some advantages such as great computing ability, flexibility and so on. We propose a new parallel algorithm named Gridify FP-Growth to implement on PC cluster. Gridify FP-Growth is based on FP-Growth algorithm, by allocating jobs to the nodes within the cluster to take full advantage of computing resource of each node. After that, the sub - result from each node will be combined to a total result. Experimental results show that Gridify FP-Growth can dramatically reduce the execution time as well as relieve the space pressure.
出处
《中国管理信息化》
2009年第15期36-38,共3页
China Management Informationization