期刊文献+

挖掘滑动窗口中的数据流频繁项算法 被引量:2

Mining Frequent Items in a Stream Based on Sliding Window
下载PDF
导出
摘要 滑动窗口是一种对最近一段时间内的数据进行挖掘的有效的技术,本文提出一种基于滑动窗口的流数据频繁项挖掘算法.算法采用了链表队列策略大大简化了算法,提高了挖掘的效率.对于给定的阈值S、误差ε和窗口长度n,算法可以检测在窗口内频度超过Sn的数据流频繁项,且使误差在εn以内.算法的空间复杂度为O(ε-1),对每个数据项的处理和查询时间均为O(1).在此基础上,我们还将该算法进行了扩展,可以通过参数的变化得到不同的流数据频繁项挖掘算法,使得算法的时间和空间复杂度之间得到调节.通过大量的实验证明,本文算法比其它类似算法具有更好的精度以及时间和空间效率. The sliding window is an effective approach to mine frequent data itmes in the recent period of time.proposed an algorithm for mining frequent items in a stream based on sliding window.The algorithm adopts linked queue strategy which greatly improves the efficiency of the algorithm.Given a threshold S,an error bound ε and the length of the sliding window n,our algorithm can determinately detect the data items within the current window whose frequncy exceeds Sn with an error less than εn using O(ε-1) memory space,and the processing time for each data item and the query time are both O(1).Based on this algorithm,we have proposed a general framework for mining frequent items in data stream based on sliding window.Under this framework,different algorithms can be constructed by changing the parameters which could adjust the time complexity and space cost of the algorithm.Through extensive experiments,we show that our algorithm outperforms other methods in terms of the accuracy,memory requirement,and processing speed.
出处 《小型微型计算机系统》 CSCD 北大核心 2012年第5期940-949,共10页 Journal of Chinese Computer Systems
基金 国家自然科学基金项目(61070047 61003180)资助 江苏省自然科学基金项目(BK2008206 BK2010311)资助 江苏省教育厅自然科学基金项目(09KJB20013)资助 江苏省信息融合软件工程技术研发中心基金项目(SR-2011-05)资助 江苏省普通高校研究生科研创新计划项目(CX08B_098Z)资助
关键词 数据流 频繁项 滑动窗口 data stream frequent item sliding window
  • 相关文献

参考文献26

  • 1Manku G S,Motwani R.Approximate frequency counts over datastreams[C].In Proc.of 28th Intl.Conf on Very Large Data Ba-ses,2002:346-357.
  • 2Karp R M,Shenker S,Papadimitriou C H.A simple algorithm forfinding frequent elements in streams and bags[J].ACM Transac-tions on Database Systems,2003,28(1):51-55.
  • 3Demaine E D,Lopez-Ortiz A,Munro J I.Frequency estimation ofinternet packet streams with limited space[C].In Proceeding of the10th Annual European Symposium on Algorithms,2002:348-360.
  • 4Misra J,Gries D.Finding repeated elements[J].Science of Com-puter Programming,1982,2(2):143-152.
  • 5Flajolet P,Martin G N.Probabilistic counting algorithms for database applications[C].Journal of Computer and System Sciences,1985,31(2):182-209.
  • 6Whang K Y,Vander-Zanden B T,Taylor H M.A linear-timeprobabilistic counting algorithm for database applications[C].ACM Transactions on Databased Systems,1990,15(2):208-229.
  • 7Golab L,DeHaan D,Lopez-Ortiz A,et al.Finding frequent itemsin sliding windows with multinomially-distributed item frequencies[C].In Proceedings of the 16th International Conference on Scien-tific and Statistical Database Management,2004:425-426.
  • 8Gibbons P B,Matias Y.New sampling-based summary statistics forimproving approximate query answers[C].Proceedings of the2001 ACM SIGMOD International Conference on Management ofData,1998:331-342.
  • 9Liu Hong-yan,Liu Ying,Han Jia-wei,et al.Error-adaptive andtime-aware maintenance of frequency counts over data streams[C].Proceeding of WAIN,2006:484-495.
  • 10Estan C,Varghese G.New directions in traffic measurement andaccounting:focusing on the elephants,ignoring the mice[J].ACM Transactions on Computer System,2003,21(3):270-313.

二级参考文献13

  • 1Babcock AK,Babu S,Datar M.Model and issues in data stream systems.In:Popa L,ed.Proc.of the 21st ACM SIGACT-SIGMOD-SIGART Symp.on Principles of Database Systems.Madison:ACM,2002.1-16.
  • 2Fang M,Shivakumar N,Garcia-Molina H,Motwani R,Ullman J.Computing iceberg queries eefficiently.In:Gupta A,Shmueli O,Widom J,eds.Proc.of the 24th Int'l Conf.on Very Large Data Bases.New York:Morgan Kaufmann Publishers,1998.299-310.
  • 3Agrawal R,Srikant R.Fast algorithms for mining association rules.In:Bocca JB,Jarke M,Zaniolo C,eds.Proc.of the 20th Int'l Conf.on Very Large Data Bases.Santiago:Morgan Kaufmann Publishers,1994.487-499.
  • 4Estan C,Verghese G.New directions in traffic measurement and accounting:Focusing on the elephants,ignoring the mice.ACM Trans.on Computer Systems,2003,21(3):270-313.
  • 5Charikar M,Chen K,Farach-Colton M.Finding frequent items in data streams.In:Widmayer P,Ruiz FT,Bueno RM,Hennessy M,Eidenbenz S,Conejo R,eds.Proc.of the Int'l Colloquium on Automata,Languages and Programming.Malaga:Springer-Verlag,2002.693-703.
  • 6Cormode G,Muthukrishnan S.What's hot and what's not:Tracking most frequent items dynamically.In:Halevy AY,Ives ZG,Doan AH,eds.Proc.of the 22nd ACM SIGACT-SIGMOD-SIGART Symp.on Principles of Database Systems.San Diego:ACM Press,2003.296-306.
  • 7Jin C,Qian W,Sha C,Yu JX,Zhou A.Dynamically maintaining frequent items over a data stream.In:Carbonell J,ed.Proc.of the 2003 ACM CIKM Int'l Conf.on Information and Knowledge Management.New Orleans:ACM Press,2003.287-294.
  • 8Manku GS,Motwani R.Approximate frequency counts over data streams.In:Bernstein P,Ioannidis Y,Ramakrishnan R,eds.Proc.of the 28th Int'l Conf.on Very Large Data Bases.Hong Kong:Morgan Kaufmann Publishers,2002.346-357.
  • 9Karp R,Papadimitriou C,Shenker S.A simple algorithm for finding frequent elements in sets and bags.Trans.on Database Systems,2003,28(1):51-55.
  • 10Demaine E,López-Ortiz A,Munro JI.Frequency estimation of Internet packet streams with limited space.In:M(o)hring RH,Raman R,eds.Algorithms.ESA 2002,Proc.of the 10th Annual European Symp.Rome:Springer-Verlag,2002.348-360.

共引文献32

同被引文献11

引证文献2

二级引证文献7

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部