This paper presents a new efficient algorithm for mining frequent closed itemsets. It enumerates the closed set of frequent itemsets by using a novel compound frequent itemset tree that facilitates fast growth and eff...This paper presents a new efficient algorithm for mining frequent closed itemsets. It enumerates the closed set of frequent itemsets by using a novel compound frequent itemset tree that facilitates fast growth and efficient pruning of search space. It also employs a hybrid approach that adapts search strategies, representations of projected transaction subsets, and projecting methods to the characteristics of the dataset. Efficient local pruning, global subsumption checking, and fast hashing methods are detailed in this paper. The principle that balances the overheads of search space growth and pruning is also discussed. Extensive experimental evaluations on real world and artificial datasets showed that our algorithm outperforms CHARM by a factor of five and is one to three orders of magnitude more efficient than CLOSET and MAFIA.展开更多
Previous research works have presented convincing arguments that a frequent pattern mining algorithm should not mine all frequent but only the closed ones because the latter leads to not only more compact yet complete...Previous research works have presented convincing arguments that a frequent pattern mining algorithm should not mine all frequent but only the closed ones because the latter leads to not only more compact yet complete result set but also better efficiency. Upon discovery of frequent closed XML query patterns, indexing and caching can be effectively adopted for query performance enhancement. Most of the previous algorithms for finding frequent patterns basically introduced a straightforward generate-and-test strategy. In this paper, we present SOLARIA*, an efficient algorithm for mining frequent closed XML query patterns without candidate maintenance and costly tree-containment checking. Efficient algorithm of sequence mining is involved in discovering frequent tree-structured patterns, which aims at replacing expensive containment testing with cheap parent-child checking in sequences. SOLARIA* deeply prunes unrelated search space for frequent pattern enumeration by parent-child relationship constraint. By a thorough experimental study on various real-life data, we demonstrate the efficiency and scalability of SOLARIA* over the previous known alternative. SOLARIA* is also linearly scalable in terms of XML queries' size.展开更多
数据流是随着时间顺序快速变化的和连续的,对其进行频繁模式挖掘时会出现概念漂移现象.在一些数据流应用中,通常认为最新的数据具有最大的价值.数据流挖掘会产生大量无用的模式,为了减少无用模式且保证无损压缩,需要挖掘闭合模式.因此,...数据流是随着时间顺序快速变化的和连续的,对其进行频繁模式挖掘时会出现概念漂移现象.在一些数据流应用中,通常认为最新的数据具有最大的价值.数据流挖掘会产生大量无用的模式,为了减少无用模式且保证无损压缩,需要挖掘闭合模式.因此,提出了一种基于时间衰减模型和闭合算子的数据流闭合模式挖掘方式TDMCS(Time-Decay-Model-based Closed frequent pattern mining on data Stream).该算法采用时间衰减模型来区分滑动窗口内的历史和新近事务权重,使用闭合算子提高闭合模式挖掘的效率,设计使用最小支持度-最大误差率-衰减因子的三层架构避免概念漂移,设计一种均值衰减因子平衡算法的高查全率和高查准率.实验分析表明该算法适用于挖掘高密度、长模式的数据流;且具有较高的效率,在不同大小的滑动窗口条件下性能表现是稳态的,同时也优于其他同类算法.展开更多
文摘This paper presents a new efficient algorithm for mining frequent closed itemsets. It enumerates the closed set of frequent itemsets by using a novel compound frequent itemset tree that facilitates fast growth and efficient pruning of search space. It also employs a hybrid approach that adapts search strategies, representations of projected transaction subsets, and projecting methods to the characteristics of the dataset. Efficient local pruning, global subsumption checking, and fast hashing methods are detailed in this paper. The principle that balances the overheads of search space growth and pruning is also discussed. Extensive experimental evaluations on real world and artificial datasets showed that our algorithm outperforms CHARM by a factor of five and is one to three orders of magnitude more efficient than CLOSET and MAFIA.
基金Acknowledgements: This work is supported by the National Natural Science Foundation of China (No. 60205007), Natural Science Foundation of Guangdong Province (No.031558, No. 04300462), Research Foundation of National Science and Technology Plan Project (No.2004BA721A02), Research Foundation of Science and Technology Plan Project in Guangdong Province (No.2003C50118), and Research Foundation of Science and Technology Plan Project in Guangzhou City (No.2002Z3-E0017).
基金This work is supported in part by the National Natural Science Foundation of China under Grant No.60573094the National Grand Fundamental Research 973 Program of China under Grant No.2006CB303103+1 种基金the National High Technology Development 863 Program of China under Grant No.2006AA01A101Tsinghua Basic Research Foundation under Grant No.JCqn2005022.
文摘Previous research works have presented convincing arguments that a frequent pattern mining algorithm should not mine all frequent but only the closed ones because the latter leads to not only more compact yet complete result set but also better efficiency. Upon discovery of frequent closed XML query patterns, indexing and caching can be effectively adopted for query performance enhancement. Most of the previous algorithms for finding frequent patterns basically introduced a straightforward generate-and-test strategy. In this paper, we present SOLARIA*, an efficient algorithm for mining frequent closed XML query patterns without candidate maintenance and costly tree-containment checking. Efficient algorithm of sequence mining is involved in discovering frequent tree-structured patterns, which aims at replacing expensive containment testing with cheap parent-child checking in sequences. SOLARIA* deeply prunes unrelated search space for frequent pattern enumeration by parent-child relationship constraint. By a thorough experimental study on various real-life data, we demonstrate the efficiency and scalability of SOLARIA* over the previous known alternative. SOLARIA* is also linearly scalable in terms of XML queries' size.
文摘数据流是随着时间顺序快速变化的和连续的,对其进行频繁模式挖掘时会出现概念漂移现象.在一些数据流应用中,通常认为最新的数据具有最大的价值.数据流挖掘会产生大量无用的模式,为了减少无用模式且保证无损压缩,需要挖掘闭合模式.因此,提出了一种基于时间衰减模型和闭合算子的数据流闭合模式挖掘方式TDMCS(Time-Decay-Model-based Closed frequent pattern mining on data Stream).该算法采用时间衰减模型来区分滑动窗口内的历史和新近事务权重,使用闭合算子提高闭合模式挖掘的效率,设计使用最小支持度-最大误差率-衰减因子的三层架构避免概念漂移,设计一种均值衰减因子平衡算法的高查全率和高查准率.实验分析表明该算法适用于挖掘高密度、长模式的数据流;且具有较高的效率,在不同大小的滑动窗口条件下性能表现是稳态的,同时也优于其他同类算法.