期刊文献+

基于计数的数据流频繁项挖掘算法 被引量:4

Algorithm Based on Counting for Mining Frequent Items over Data Stream
下载PDF
导出
摘要 挖掘数据流的频繁项已受到广泛关注,经典的频繁项挖掘算法尽管能够比较好地找到频繁项,但对频繁项频数的估计往往存在较大误差.SRoEC(segment rotative efficient count),SReEC(segment reserve efficient count)和RFreq(reserve frequent)算法针对该问题,继承基于计数的算法思想,将计数器进行划分并定义相应的操作,以期提高频数统计准确度并减小"噪音"影响.实验和数据分析表明,这些算法不仅能够保证频数超过阈值的数据项都能被找到,而且大大提高了频繁项频数统计的准确性.在同样空间代价下,算法无论在模拟数据集和真实数据集实验中,都表现出较高的频数准确率、较低的频数偏差率和较高的频数保有率,尤其是数据分布较平缓时,算法优势更加明显. Mining frequent items over data stream has drawn great attention, and large amount of efficient algorithms have been proposed by many researchers over the past decades. Although the classical algorithms are well suited to find frequent items, usually they do not perform well when estimating items' approximate frequency. To solve this problem, we introduce a series of counter- based algorithms called SRoEC (segment rotative efficient count), SReEC (segment reserve efficient count) and RFreq (reserve frequent). They divide the counter used in classical algorithms and define operations for counters to improve the accuracy of item frequency and avoid the effect of low frequency items. As the experience shows, these algorithms can find Top-K items above the threshold correctly and return their approximate frequency as accurate as possible. Both analysis and experiments demonstrate that under same cost of space, these algorithms return higher count accuracy rate, lower frequency error rate and higher frequency reserve rate on both simulated data set and real data set when compared with the two best classical algorithms (frequent algorithm and space saving algorithm) nowadays. Amongst them, RFreq algorithm shows obvious advantages. What's more, the algorithms perform much better than classical ones when the data distribution is smooth.
出处 《计算机研究与发展》 EI CSCD 北大核心 2011年第10期1803-1811,共9页 Journal of Computer Research and Development
基金 高等学校博士学科点专项科研基金项目(20090071120092) IBMCRLUR基金项目(JSA201007005)
关键词 频繁项 Top—K 数据流 数据挖掘 频数估计 words frequent item Top-K data stream data mining frequency estimation
  • 相关文献

参考文献12

  • 1Gaber M M, Zaslavsky A, Krishnaswamy S. Mining data streams: A review [J]. ACM SIGMOD Record, 2005, 34 (2) : 18-26.
  • 2Cormode G, Hadjieleftheriou M. Methods for finding frequent items in data streams [J]. The Int Journal on VLDB, 2010, 19(1):3-20.
  • 3Cormode G, Muthukrishnan S. An improved data stream summary: The count rain sketch and its applications [J]. Journal of Algorithms, 2005, 55(1): 58-75.
  • 4Manku G S, Motwani R. Approximate frequency counts over data streams [C] //Proc of the 28th Int Conf on VLDB. San Francisco: Morgan Kaufmann, 2002:346-357.
  • 5王伟平,李建中,张冬冬,郭龙江.一种有效的挖掘数据流近似频繁项算法[J].软件学报,2007,18(4):884-892. 被引量:33
  • 6Metwally A, Agrawal D, Abbadi A. Efficient computation of frequent and Top-k elements in data streams [G] //LNCS 3363. Berlin: Springer, 2005: 398-412.
  • 7Shrivastava N, guragohain C, Agrawal D, et al. Median and beyond new aggregation techniques for sensor networks [C] //Proc of the 2nd Int Conf on Embedded Networked Sensor Systems. New York: ACM, 2004:239-249.
  • 8Manerikar N, Palpanas T. Frequent items in streaming data: An experimental evaluation of the state-of-the-art [J]. Data Knowledge Engineering, 2009, 68(4): 415-430.
  • 9Gibbons P B, Matias Y. New sampling-based summary statistics for improving approximate query answers [C]// Proc of the 1998 ACM SIGMOD Int Conf on Management of Data. New York: ACM, 1998:331-342.
  • 10Demaine E D, Lopez Ortiz A, Munro J I. Frequency estimation of internet packet streams with limited space [G] //LNCS 2461. Berlin: Springer, 2002: 11-20.

二级参考文献13

  • 1Babcock AK,Babu S,Datar M.Model and issues in data stream systems.In:Popa L,ed.Proc.of the 21st ACM SIGACT-SIGMOD-SIGART Symp.on Principles of Database Systems.Madison:ACM,2002.1-16.
  • 2Fang M,Shivakumar N,Garcia-Molina H,Motwani R,Ullman J.Computing iceberg queries eefficiently.In:Gupta A,Shmueli O,Widom J,eds.Proc.of the 24th Int'l Conf.on Very Large Data Bases.New York:Morgan Kaufmann Publishers,1998.299-310.
  • 3Agrawal R,Srikant R.Fast algorithms for mining association rules.In:Bocca JB,Jarke M,Zaniolo C,eds.Proc.of the 20th Int'l Conf.on Very Large Data Bases.Santiago:Morgan Kaufmann Publishers,1994.487-499.
  • 4Estan C,Verghese G.New directions in traffic measurement and accounting:Focusing on the elephants,ignoring the mice.ACM Trans.on Computer Systems,2003,21(3):270-313.
  • 5Charikar M,Chen K,Farach-Colton M.Finding frequent items in data streams.In:Widmayer P,Ruiz FT,Bueno RM,Hennessy M,Eidenbenz S,Conejo R,eds.Proc.of the Int'l Colloquium on Automata,Languages and Programming.Malaga:Springer-Verlag,2002.693-703.
  • 6Cormode G,Muthukrishnan S.What's hot and what's not:Tracking most frequent items dynamically.In:Halevy AY,Ives ZG,Doan AH,eds.Proc.of the 22nd ACM SIGACT-SIGMOD-SIGART Symp.on Principles of Database Systems.San Diego:ACM Press,2003.296-306.
  • 7Jin C,Qian W,Sha C,Yu JX,Zhou A.Dynamically maintaining frequent items over a data stream.In:Carbonell J,ed.Proc.of the 2003 ACM CIKM Int'l Conf.on Information and Knowledge Management.New Orleans:ACM Press,2003.287-294.
  • 8Manku GS,Motwani R.Approximate frequency counts over data streams.In:Bernstein P,Ioannidis Y,Ramakrishnan R,eds.Proc.of the 28th Int'l Conf.on Very Large Data Bases.Hong Kong:Morgan Kaufmann Publishers,2002.346-357.
  • 9Karp R,Papadimitriou C,Shenker S.A simple algorithm for finding frequent elements in sets and bags.Trans.on Database Systems,2003,28(1):51-55.
  • 10Demaine E,López-Ortiz A,Munro JI.Frequency estimation of Internet packet streams with limited space.In:M(o)hring RH,Raman R,eds.Algorithms.ESA 2002,Proc.of the 10th Annual European Symp.Rome:Springer-Verlag,2002.348-360.

共引文献32

同被引文献40

引证文献4

二级引证文献5

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部