期刊文献+

数据流上的分位数近似算法研究 被引量:3

Research on an Algorithm for Approximate Quantile Computation over Data Streams
下载PDF
导出
摘要 数据流是一种新型数据模型,广泛应用于交通流量监控、通信管理、传感器网络、股票分析、Web点击流等众多领域.近年来越来越多的学者关注于数据流上的分位数计算研究.由于流数据的连续、无界、易失等特性,存储完整的流数据信息并得到精确的查询结果几乎是不可能的.在实施查询计算时追求内存用量与查询精度之间的最佳均衡.设计了规范数直方图的概要数据结构以存储流数据的摘要信息,并在此基础上提出了单遍扫描的、联机的分位数近似算法,其时间和空间复杂度均线性于概要结构中桶的个数,而与数据流的长度无关,因而具有很好的可规模性.该方法在均匀分布的数据上取得了优良性能.分析了算法精度与内存需求的关系.实验结果表明该算法具有较精确的查询结果,具备良好的实用性和有效性. Data stream is a new data model that has attracted attentions in numerous applications such as traffic monitoring, telephone records management, sensor networks, stock-market analysis, Web click streams, etc. The importance of quantile estimation of data streams has been highlighted by more and more researchers in recent years. Due to the characteristics of continuity and boundlessness of streaming data, it is unfeasible to memorize the entire information of data streams and thus difficult to get the exact answer to the query on streaming data. In this paper, a novel synopsis data structure-Nord_Histogram for storing streaming data summary is designed to get a balance between the memory cost and the query accuracy, and a one-pass online approximate algorithm for quantile computation is presented. The algorithm implements the approximate quantile queries over data stream with the time and space requirements being linear with the number of the buckets, which has no business with the length of data streams, and therefore has great scalability. This method has very good performance on data with uniform distribution. The correlation between the computation accuracy and main memory requirement is also analyzed. Experimental results show that the algorithm brings about quite small relative errors and works well over data streams.
作者 杨蓓 黄厚宽
出处 《计算机研究与发展》 EI CSCD 北大核心 2008年第2期287-292,共6页 Journal of Computer Research and Development
基金 国家“九七三”重点基础研究发展规划基金项目(2006CB705500)
关键词 数据流 概要数据结构 直方图 分位数 近似算法 data stream synopsis data structure histogram quantile approximate algorithm
  • 相关文献

参考文献13

  • 1B Babcock, S Babu, M Datar, et al. Models and issues in data streams [C]. In: L Popa ed. Proe of the 21st ACM SIGACTSIGMOD-SIGART Syrnp on Prineiples of Database Systems. New York: ACM Press, 2002. 1-16.
  • 2P B Gibbons, Y Matias. Synopsis data structures for massive data sets [C]. The 10th Annual ACM-SIAM Symp on Discrete Algorithms, Baltimore, 1999.
  • 3J S Vitter . Random sampling with a reservoir [J]. ACM Trans on Mathematical Software, 1985, 11 (1) : 37-57.
  • 4P B Gibbons, Y Matias. New sampling-based summary statistics for improving approximate query answers [C]. In: L M Haas, A Tiwary, eds. Proc of the ACM SIGMOD Int'l Conf on Management of Data ( SIGMOD1998 ) . New York: ACM Press, 1998. 331-342.
  • 5M Charikar, K Chen, M Faraeh-Colton. Finding frequent items in data streams [J]. Theoretical Computer Science, 2004, 312 (1): 3-15.
  • 6P B Gibbons, Y Matias, V Poosala. Fast incremental maintenance of approximate histograms [C]. In: M Jarke, M J Carey, K R Dittrich, eds. Proc of the 23rd Int'l Conf on Very Large Data Bases VLDB' 97. San Francisco: Morgan Kaufmann, 1997. 466-475.
  • 7Y Ioannidis, V Poosala. Balancing histogram optimality and practicality for query result size estimation []]. S1GMOD Record, 1995, 24(2) : 233-244.
  • 8N K Govindaraju, N Raghuvanshi, D Manocha. Fast and approximate stream mining of quantiles and frequencies using graphics processors [C]. The 2005 ACM SIGMOD Int'l Conference, Baltimore, 2005.
  • 9张龙波,李战怀,闫剑锋.一种面向数据流处理的直方图增量维护算法[J].计算机工程,2005,31(14):83-84. 被引量:1
  • 10Y Matias, J Vitter, M Wang. Wavelet-based histograms for selectivity estimation [C]. The 1998 ACM SIGMOD Int'l Conf on Management of Data, Seattle, 1998.

二级参考文献8

  • 1Babcock B, Babu S, Datar M, et al. Models and Issues in Data Stream Systems. In Proc. 21^st ACM SIGACT-SIGMODSIGART Symp. on Principles of Database Systems, Madison,Wisconsin, 2002-05: 1-16
  • 2Babu S, Widom J. Continuous Queries over Data Streams. SIGMOD Record, 2001, 30(3): 109-120
  • 3Guha S, Koudas N. Approximating a Data Stream for Querying and Estimation: Algorithm Performance Evaluation, ICDE'02, 2002
  • 4韩近强 杨东青 唐世渭.数据流处理中一种自适应的直方图维护算法[A]..第20届全国数据库学术会议论文集A集[C].长沙,2003-10.178-181.
  • 5Ioanidis Y.The History of Histogram (Abridged). Berlin, Germany, Proceedings of the 29^th VLDB Conference, 2003
  • 6Jagadish H V, Koudas N, Muthukrishnan S, et al. Optimal Histograms with Quality Guarantees. Proceedings of VLDB, 1998-08: 275
  • 7Guha S, Koudas N, Shim K. Data Streams and Histograms. Symposiumon the Theory of Computing (STOC), 2001-07
  • 8吴胜利.估算查询结果大小的直方图方法之研究[J].软件学报,1998,9(4):285-289. 被引量:16

同被引文献21

  • 1金澈清,钱卫宁,周傲英.流数据分析与管理综述[J].软件学报,2004,15(8):1172-1181. 被引量:161
  • 2吴恩华.图形处理器用于通用计算的技术、现状及其挑战[J].软件学报,2004,15(10):1493-1504. 被引量:141
  • 3曹锋,周傲英.基于图形处理器的数据流快速聚类[J].软件学报,2007,18(2):291-302. 被引量:24
  • 4BABCOCK B, BABU S, DATAR M, et al. Models and issues in data stream systems[ C]//PODS 2002: Proceedings of 21st ACM Symposiurm on Principles of Database Systems. New York: ACM, 2002:1 - 16.
  • 5BABCOCK B, BABU S, DATAR M, et al. Operator scheduling in data stream systems[ J]. The VLDB Journal, 2004, 12(13) : 333 - 353.
  • 6ZHUYUN - YUE, SHASHA D. StatStream : Statistical monitoring of thousands of data streams in real time[ C]// Proceedings of the 28th VLDB Conference. Hong Kong: VLDB Endowment, 21302:358 -369.
  • 7GOLAB L, GARG S, TAMEROZSU M. On indexing sliding windows over online data streams [ C]// EDBT 2004, LNCS 2992. Berlin: Springer-Verlag, 2004:712-729.
  • 8GOVINDARAJU N K, RAGHUVANSHI N, MANOCHA D. Fast and approximate stream mining of quantiles and frequencies using graphics processors[ C]//Proceedings of ACM SIGMOD 2005. New York: ACM, 2005:611-622.
  • 9Nvidia. NVIDIA CUDA programming guide[ EB/OL]. (2008 - 06) [21308- 06 -07]. http://developer, download, nvidia, com/compute,/ cuda/2_0/NVIDIA_ CUDA_Programming_Guide_2.0. pdf.
  • 10DOTSENKO Y, GOVINDARAJU N K, SLOAN P-P, et al. Fast scan algorithms on graphics processors [ C]// Proceedings of ACM International Conference on Supercomputing. New York: ACM, 2006:205-214.

引证文献3

二级引证文献3

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部