摘要
数据流是随着时间顺序快速变化的和连续的,其包含的知识会随着时间的改变而不同.在一些数据流应用中,通常认为最新的数据具有最大的价值.因此,会采用时间衰减模型来挖掘数据流中的频繁模式.已有的衰减因子设计方式通常具有随机性,使得到的结果集具有不稳定性;或仅考虑算法的高查全率或查准率,而忽略了算法对应的高查准率或查全率.为了平衡算法的高查全率和高查准率同时保证结果集的稳定性,设计了均值衰减因子设置方式.为了更进一步地增加最新事务的权重、减少历史事务的权重,设计了采用高斯函数设置高斯衰减因子的方式.为了比较不同衰减因子设计方式的优劣,研究并设计了4种方式的时间衰减模型,并采用这4种模型挖掘数据流闭合频繁模式.通过对高密度和低密度数据流分别进行频繁挖掘的实验结果分析可以得出,采用均值衰减因子设置方式可以平衡高查全率和高查准率;采用高斯衰减因子设置方式与其他方法相比,可以得到更优的算法性能.
Data stream is a continuous and time changed sequence of data elements,and contained information is different over time.In some data stream applications,the information embedded in the data arriving in the new recent time period is of particular value.Therefore,time decay model(TDM)is used for mining frequent patterns on data stream.Existing methods to design time decay factor have the characteristics of randomness,so the result set is unsteady.Or,the methods just consider 100%recall or 100% precision of the algorithm,while they ignore the corresponding high precision or recall.In order to balance high recall and high precision of the algorithm and ensure the stability of the result set,a novel way to set average decay factor is designed.To further increase the weights of the latest transactions and reduce the weights of historical transactions,another novel way to design decay factor based on Gaussian function is proposed.For comparing the pros and cons of different time factors,four time decay models are researched and designed.The algorithms based on these four models are designed to discover closed frequent patterns over data streams.The performance of the proposed methods to mine the frequent patterns on the high-density or low-density data streams is evaluated via experiments.Results show that using the average time decay factor balances the high recall and high precision of the algorithm.Compared with other ways,setting decay factor based on Gaussian function gets better performance than them.
出处
《计算机研究与发展》
EI
CSCD
北大核心
2015年第12期2834-2843,共10页
Journal of Computer Research and Development
基金
国家自然科学基金项目(61563001)
国家民委科研基金项目(14BFZ008)
北京市自然科学基金项目(4142042)
北方民族大学科研基金项目(2013QZP02)