摘要
频繁闭项集惟一确定频繁项集且规模小得多。CROP是一种基于复合频繁模式树的、频繁闭项集高效挖掘算法,但存在着候选结点过多的问题。这些非闭合结点的生成、检查和剪裁带来了大量不必要的操作。提出了一种改进的频繁闭项集挖掘算法CROP_Index。该算法用"索引数组"来组织数据,找到频繁共同出现的项集。基于二进制位图,给出了一个包含索引的计算方法,并利用索引启发信息合并,得到复合型频繁模式树的初始结点;同时给出一些新的性质,使得改进的算法只生成闭合结点,从而节省了大量不必要的操作,缩小了搜索空间。实验结果表明该算法效率较高。
The set of frequent closed itemsets determines exactly the complete set of all frequent itemsets and is usually much smaller than the latter. Based on compound frequent itemset tree (CFIST), CROP is an efficient algorithm for mining frequent closed itemset. However, there are a lot of redundant candidate nodes of CFIST. Correspondingly, the operations, composed of the generation, closure checking, and pruning of those non-closed nodes, lead to high computational cost. In this paper, CROP Index, which is an improved algorithm for mining frequent closed itemset, is proposed. Firstly, the "index array" is proposed, which is used for discovering those itemsets that always appear together. Then, based on bitmap, an algorithm for computing index array is presented. Furthermore, frequent items are merged to initial nodes of CFIST according to heuristic information provided by index array. Finally, some new properties, which can avoid the generation of redundant nodes, are proposed. Thus the improved algorithm only generates closed nodes. Correspondingly, the unnecessary operations are avoided, and the search space is reduced to certain extents. The experimental results show that the proposed algorithm is efficient.
出处
《计算机科学》
CSCD
北大核心
2007年第8期165-167,189,共4页
Computer Science
基金
国家科技成果重点推广项目计划(2003EC000001)资助
关键词
数据挖掘
关联规则
频繁闭项集
索引数组
复合频繁模式树
Data mining, Association rule, Frequent closed itemset, Index array, Compound frequent itemset tree