摘要
频繁项集挖掘FIM是最重要的数据挖掘任务之一,被挖掘数据集的特征对FIM算法的性能有着显著影响。在大数据时代,稀疏是大数据的典型特征之一,对传统FIM算法的性能带来严峻挑战。针对在稀疏数据中如何高效进行FIM的问题,从稀疏数据的特征出发,分析了稀疏数据对3种类型FIM算法性能的主要影响,对已经提出的稀疏数据FIM算法进行了综述,对算法中采用的优化策略进行了讨论,最后通过实验对代表性的稀疏数据FIM算法进行了性能分析。实验结果表明,采用伪构造策略的模式增长算法最适合用于稀疏数据的FIM,在运算时间和存储空间上,相比其他算法该算法具有较大的优势。
Frequent itemset mining (FIM) is one of the most important data mining tasks.The characteristics of datasets have a significant impact on the performance of FIM algorithms.In the era of big data,sparseness,a typical feature of big data,brings severe challenges to the performance of traditional FIM algorithms.Aiming at the problem of how to perform FIM in sparse datasets efficiently,based on the characteristics of sparse datasets,we analyze the main effects of sparse datasets on the performance of three FIM algorithms,summarize current sparse datasets FIM algorithms,discuss the optimization strategies used in these algorithms,and analyse the performance of the typical sparse datasets FIM algorithms through experiments.Experimental results show that the pattern growth algorithm with pseudo-structural strategy is most suitable for FIM in sparse datasets and outperforms the other two algorithms in both operation time and storage space.
作者
肖文
胡娟
XIAO Wen;HU Juan(Wentian College,Hohai University,Maanshan 243031,China)
出处
《计算机工程与科学》
CSCD
北大核心
2019年第5期780-787,共8页
Computer Engineering & Science
基金
安徽省高校优秀青年人才支持计划(gxyq2018139)
关键词
大数据
稀疏数据
频繁项集挖掘
性能分析
综述
big data
sparse data
frequent itemset mining (FIM)
performance analysis
survey