挖掘滑动窗口中的数据流频繁项算法被引量：2

Mining Frequent Items in a Stream Based on Sliding Window

下载PDF

导出

摘要滑动窗口是一种对最近一段时间内的数据进行挖掘的有效的技术,本文提出一种基于滑动窗口的流数据频繁项挖掘算法.算法采用了链表队列策略大大简化了算法,提高了挖掘的效率.对于给定的阈值S、误差ε和窗口长度n,算法可以检测在窗口内频度超过Sn的数据流频繁项,且使误差在εn以内.算法的空间复杂度为O(ε-1),对每个数据项的处理和查询时间均为O(1).在此基础上,我们还将该算法进行了扩展,可以通过参数的变化得到不同的流数据频繁项挖掘算法,使得算法的时间和空间复杂度之间得到调节.通过大量的实验证明,本文算法比其它类似算法具有更好的精度以及时间和空间效率. The sliding window is an effective approach to mine frequent data itmes in the recent period of time.proposed an algorithm for mining frequent items in a stream based on sliding window.The algorithm adopts linked queue strategy which greatly improves the efficiency of the algorithm.Given a threshold S,an error bound ε and the length of the sliding window n,our algorithm can determinately detect the data items within the current window whose frequncy exceeds Sn with an error less than εn using O（ε-1） memory space,and the processing time for each data item and the query time are both O（1）.Based on this algorithm,we have proposed a general framework for mining frequent items in data stream based on sliding window.Under this framework,different algorithms can be constructed by changing the parameters which could adjust the time complexity and space cost of the algorithm.Through extensive experiments,we show that our algorithm outperforms other methods in terms of the accuracy,memory requirement,and processing speed.

作者屠莉陈崚包芳

机构地区江苏省信息融合软件工程技术研发中心江阴职业技术学院计算机科学系扬州大学计算机科学与工程系南京大学南京大学计算机软件新技术国家重点实验室

出处《小型微型计算机系统》 CSCD 北大核心 2012年第5期940-949,共10页 Journal of Chinese Computer Systems

基金国家自然科学基金项目(61070047 61003180)资助江苏省自然科学基金项目(BK2008206 BK2010311)资助江苏省教育厅自然科学基金项目(09KJB20013)资助江苏省信息融合软件工程技术研发中心基金项目(SR-2011-05)资助江苏省普通高校研究生科研创新计划项目(CX08B_098Z)资助

关键词数据流频繁项滑动窗口 data stream frequent item sliding window

分类号 TP311 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献26

1Manku G S,Motwani R.Approximate frequency counts over datastreams[C].In Proc.of 28th Intl.Conf on Very Large Data Ba-ses,2002:346-357.
2Karp R M,Shenker S,Papadimitriou C H.A simple algorithm forfinding frequent elements in streams and bags[J].ACM Transac-tions on Database Systems,2003,28(1):51-55.
3Demaine E D,Lopez-Ortiz A,Munro J I.Frequency estimation ofinternet packet streams with limited space[C].In Proceeding of the10th Annual European Symposium on Algorithms,2002:348-360.
4Misra J,Gries D.Finding repeated elements[J].Science of Com-puter Programming,1982,2(2):143-152.
5Flajolet P,Martin G N.Probabilistic counting algorithms for database applications[C].Journal of Computer and System Sciences,1985,31(2):182-209.
6Whang K Y,Vander-Zanden B T,Taylor H M.A linear-timeprobabilistic counting algorithm for database applications[C].ACM Transactions on Databased Systems,1990,15(2):208-229.
7Golab L,DeHaan D,Lopez-Ortiz A,et al.Finding frequent itemsin sliding windows with multinomially-distributed item frequencies[C].In Proceedings of the 16th International Conference on Scien-tific and Statistical Database Management,2004:425-426.
8Gibbons P B,Matias Y.New sampling-based summary statistics forimproving approximate query answers[C].Proceedings of the2001 ACM SIGMOD International Conference on Management ofData,1998:331-342.
9Liu Hong-yan,Liu Ying,Han Jia-wei,et al.Error-adaptive andtime-aware maintenance of frequency counts over data streams[C].Proceeding of WAIN,2006:484-495.
10Estan C,Varghese G.New directions in traffic measurement andaccounting:focusing on the elephants,ignoring the mice[J].ACM Transactions on Computer System,2003,21(3):270-313.

二级参考文献13

1Babcock AK,Babu S,Datar M.Model and issues in data stream systems.In:Popa L,ed.Proc.of the 21st ACM SIGACT-SIGMOD-SIGART Symp.on Principles of Database Systems.Madison:ACM,2002.1-16.
2Fang M,Shivakumar N,Garcia-Molina H,Motwani R,Ullman J.Computing iceberg queries eefficiently.In:Gupta A,Shmueli O,Widom J,eds.Proc.of the 24th Int'l Conf.on Very Large Data Bases.New York:Morgan Kaufmann Publishers,1998.299-310.
3Agrawal R,Srikant R.Fast algorithms for mining association rules.In:Bocca JB,Jarke M,Zaniolo C,eds.Proc.of the 20th Int'l Conf.on Very Large Data Bases.Santiago:Morgan Kaufmann Publishers,1994.487-499.
4Estan C,Verghese G.New directions in traffic measurement and accounting:Focusing on the elephants,ignoring the mice.ACM Trans.on Computer Systems,2003,21(3):270-313.
5Charikar M,Chen K,Farach-Colton M.Finding frequent items in data streams.In:Widmayer P,Ruiz FT,Bueno RM,Hennessy M,Eidenbenz S,Conejo R,eds.Proc.of the Int'l Colloquium on Automata,Languages and Programming.Malaga:Springer-Verlag,2002.693-703.
6Cormode G,Muthukrishnan S.What's hot and what's not:Tracking most frequent items dynamically.In:Halevy AY,Ives ZG,Doan AH,eds.Proc.of the 22nd ACM SIGACT-SIGMOD-SIGART Symp.on Principles of Database Systems.San Diego:ACM Press,2003.296-306.
7Jin C,Qian W,Sha C,Yu JX,Zhou A.Dynamically maintaining frequent items over a data stream.In:Carbonell J,ed.Proc.of the 2003 ACM CIKM Int'l Conf.on Information and Knowledge Management.New Orleans:ACM Press,2003.287-294.
8Manku GS,Motwani R.Approximate frequency counts over data streams.In:Bernstein P,Ioannidis Y,Ramakrishnan R,eds.Proc.of the 28th Int'l Conf.on Very Large Data Bases.Hong Kong:Morgan Kaufmann Publishers,2002.346-357.
9Karp R,Papadimitriou C,Shenker S.A simple algorithm for finding frequent elements in sets and bags.Trans.on Database Systems,2003,28(1):51-55.
10Demaine E,López-Ortiz A,Munro JI.Frequency estimation of Internet packet streams with limited space.In:M(o)hring RH,Raman R,eds.Algorithms.ESA 2002,Proc.of the 10th Annual European Symp.Rome:Springer-Verlag,2002.348-360.

共引文献32

1邝祝芳,阳国贵,辛动军.SWFPM:一种有效的数据流频繁项挖掘算法[J].计算机应用研究,2009,26(2):466-469. 被引量：4
2张玉,方滨兴,张永铮.高速网络监控中大流量对象的识别[J].中国科学：信息科学,2010,40(2):340-355. 被引量：11
3高宏宾,张小彬,杨海振.一种实时挖掘数据流近似频繁项的算法[J].计算机应用,2008,28(S2):219-222. 被引量：2
4李建中,高宏.无线传感器网络的研究进展[J].计算机研究与发展,2008,45(1):1-15. 被引量：441
5王秀坤,王铁存,周国能,冯维.挖掘数据流近似频繁项的改进算法[J].计算机工程与应用,2008,44(13):150-152.
6邝祝芳,谭骏珊,杨卫民,辛动军.基于渐增最小支持度函数的数据流频繁项挖掘[J].微电子学与计算机,2008,25(10):196-198.
7祖悦,党德玉.网格环境下基于分布式数据流频繁模式的数据更新算法[J].吉林化工学院学报,2009,26(1):54-58.
8吴枫,仲妍,金鑫,吴泉源,贾焰,杨树强.滑动窗口内进化数据流任意形状聚类算法[J].小型微型计算机系统,2009,30(5):887-890. 被引量：6
9蒲天银,秦拯.安全态势数据源近似频繁项分析算法应用[J].福建电脑,2009(8):84-84.
10张啸剑,邵超,张亚东.动态Web点击流中频繁访问序列的挖掘[J].计算机工程,2009,35(14):58-59. 被引量：1

同被引文献11

1刘学军,徐宏炳,董逸生,钱江波,王永利.基于滑动窗口的数据流闭合频繁模式的挖掘[J].计算机研究与发展,2006,43(10):1738-1743. 被引量：26
2范明,孟小峰.数据挖掘概念与技术[M].2版.北京:机械工业出版社,2007:195-196.
3敖富江,颜跃进,黄健,黄柯棣.数据流频繁模式挖掘算法设计[J].计算机科学,2008,35(3):1-5. 被引量：11
4李国徽,杨兵,胡惇,陈辉,杜建强.挖掘滑动窗口中的数据流频繁模式[J].小型微型计算机系统,2008,29(8):1491-1497. 被引量：9
5李国徽,陈辉.挖掘数据流任意滑动时间窗口内频繁模式[J].软件学报,2008,19(10):2585-2596. 被引量：45
6张忠平,王浩,薛伟,夏炎.动态滑动窗口的数据流聚类方法[J].计算机工程与应用,2011,47(7):135-138. 被引量：19
7张玉红,胡学钢,李培培.一种抗噪的概念漂移数据流分类方法[J].中国科学技术大学学报,2011,41(4):347-352. 被引量：1
8刘力雄,郭云飞,康晶,马宏.分布式数据流聚类算法[J].计算机工程与设计,2011,32(8):2708-2711. 被引量：2
9祝然威,王鹏,刘马金.基于计数的数据流频繁项挖掘算法[J].计算机研究与发展,2011,48(10):1803-1811. 被引量：4
10郭躬德,李南,陈黎飞.一种适应概念漂移数据流的分类算法[J].山东大学学报（工学版）,2012,42(4):1-7. 被引量：2

引证文献2

1程军锋.数据流挖掘技术研究[J].洛阳师范学院学报,2014,33(2):37-39. 被引量：1
2李芬田,王红梅,潘超.滑动窗口中FP-Tree的频繁项集挖掘算法的研究[J].小型微型计算机系统,2019,40(1):45-49. 被引量：6

二级引证文献7

1廖纪勇,吴晟,刘爱莲.一种基于邻接矩阵的频繁项集挖掘算法[J].数据通信,2020(6):30-34. 被引量：1
2吴陈,孙宏.一种对数据流进行聚类的改进算法[J].电子设计工程,2017,25(22):23-25. 被引量：1
3卫朝霞,邹倩影.基于模式增长的嵌入式频繁子树挖掘算法研究[J].计算机仿真,2021,38(3):249-252.
4张婷曼,牛奕翔,李娜.基于fg-growth算法的大数据频繁项集挖掘方法[J].现代雷达,2021,43(11):63-67. 被引量：4
5吴文波,吴昌钱,胡永.多源异构数据集下的改进挖掘算法设计[J].计算机仿真,2022,39(4):506-510.
6周燕,肖莉.基于改进关联聚类算法的网络异常数据挖掘[J].计算机工程与设计,2023,44(1):108-115. 被引量：12
7郭振华,孙艳青,王中兴.基于并行式频繁项集的党政收费平台[J].电子设计工程,2024,32(5):31-36.

1郜林,王荃.一种基于UMTS核心网的定时器实现方案[J].计算机科学,2011,38(B10):376-379. 被引量：2
2李会,王宜怀,王磊.基于CAN的数据无损代码更新方法设计与应用[J].电子技术应用,2016,42(1):40-43. 被引量：9
3刘志,张晶.基于哈希算法的脏数据回写磁盘实时调优策略[J].计算机工程,2014,40(6):5-7.
4高志民,姚崎.面向并行安全网关流水线模型的无锁队列算法[J].北京交通大学学报,2010,34(5):8-13. 被引量：1

小型微型计算机系统

2012年第5期

浏览历史

内容加载中请稍等...

挖掘滑动窗口中的数据流频繁项算法被引量：2

参考文献26

二级参考文献13

共引文献32

同被引文献11

引证文献2

二级引证文献7

相关作者

相关机构

相关主题

浏览历史

挖掘滑动窗口中的数据流频繁项算法 被引量：2

参考文献26

二级参考文献13

共引文献32

同被引文献11

引证文献2

二级引证文献7

相关作者

相关机构

相关主题

浏览历史

挖掘滑动窗口中的数据流频繁项算法被引量：2