期刊文献+

数据挖掘取样方法研究 被引量:54

Study of Sampling Methods on Data Mining and Stream Mining
下载PDF
导出
摘要 取样是一种通用有效的近似技术.在数据挖掘研究中,取样方法可显著减小所处理数据集的规模,使得众多数据挖掘算法得以应用到大规模数据集以及数据流数据上.通过对应用于数据挖掘领域的代表性取样方法的比较研究和分析总结,提出了一个取样算法分类框架.在指出了均匀取样局限性的基础上阐述了某些应用场景中选用偏倚取样方法的必要性,综述了取样技术在数据挖掘领域的应用研究与应用发展,最后对数据流挖掘取样方法面临的挑战和发展方向进行了展望. Sampling is an efficient and most widely-used approximation technique.It enables lots of algorithms to be applied to huge dataset by use of scaling down dramatically dataset for data mining and streaming mining.Throughout the detailed review,a kind of taxonomic frame of sampling algorithms based on uniform sampling and biased sampling is presented;meanwhile,analysis,comparisons and evaluations of representative sampling algorithms such as reservoir sampling,concise sampling,count sampling,chain-sampling,DV sampling and so on are performed.Due to the limitations of uniform sampling in some applications—queries with relatively low selectivity,outlier detection in large multidimensional data sets,and clustering over data streams with skewed Zipf distribution,the importance of need for using biased sampling methods in these scenarios is fully dissertated.In addition to listing successful applications of sampling techniques in data mining,statistics estimating and stream mining up to now,we survey the application and development of sampling techniques,especially those traditional classic sampling techniques such as progressive sampling,adaptive sampling,stratified sampling and two-phase sampling etc.Finally,future challenges and directions with respect to data stream sampling are further discussed.
出处 《计算机研究与发展》 EI CSCD 北大核心 2011年第1期45-54,共10页 Journal of Computer Research and Development
基金 国家自然科学基金项目(60873176) 福建省教育厅科技项目(JA08161)
关键词 数据挖掘 均匀取样 偏倚取样 数据流 概要数据结构 data mining uniform sampling biased sampling data stream synopsis data structure
  • 相关文献

参考文献58

  • 1Vitter J S. Random sampling with a reservoir [J]. ACM Trans on Mathematical Software, 1985, 11(1): 37-57.
  • 2Cochran W G. Sampling Techniques [M]. 3rd ed. New York: John Wiley & Sons, 1977.
  • 3Levy P S, Lemeshow S. Sampling of Populations" Methods and Applications [M]. New York: John Wiley & Sons, 1991.
  • 4Lohr S L. Sampling: Design and Analysis [M]. Pacific Grove, CA: Duxbury Press, 1999.
  • 5Olken F, Rotem D. Random sampling from B+trees[C] // Proc of the 15th Int Conf on VLDB. San Francisco: Morgan Kaufmann, 1989:269-277.
  • 6Olken F, Rotem D. Sampling from spatial databases [J]. Statistics and Computing, 1995, 5(1): 43-57.
  • 7Gibbons P B, Matias Y. New sampling-based summary statistics for improving approximate query answers [C] // Proc of ACM SIGMOD 1998. New York: ACM, 1998: 331- 342.
  • 8Aeharya S, Gibbons P B, Poosala V. Congressional samples for approximate answering of group-by queries [C] //Proc of the ACM SIGMOD on Management of Data. New York: ACM, 2000:487-498.
  • 9Chaudhuri S, Das G, Datar M, et al. Overcoming limitations of sampling for aggregation queries [C]//Proc of ICDE 2001. Los Alamitos, CA: IEEE Computer Society, 2001: 534-542.
  • 10Gibbons P B. Distinct sampling for highly-accurate answers to distinct values queries and event reports [C] //Proc of VLDB 2001. San Francisco: Morgan Kaufmann, 2001:541- 550.

二级参考文献87

  • 1朱文锋.创立以证素为核心的辨证新体系[J].湖南中医学院学报,2004,24(6):38-39. 被引量:295
  • 2贾彩燕 倪现君.关联规则挖掘研究述评[J].计算机科学,2003,30(4):145-148.
  • 3徐仲 张凯院 陆全.矩阵论简明教程[M].北京:科学出版社,2002.140-143.
  • 4Zhang W,Proc 23rd VL DB Conf,1997年,186页
  • 5Chen M S,IEEE Trans Knowledge Data Engineering,1996年,8卷,6期,866页
  • 6Zhang T,Proc ACM SIGMOD Int Conf on Management of Data,1996年,73页
  • 7Ng R T,Proc 20th VLDB Conf,1994年,144页
  • 8Babcock B, Babu S, Datar M, Motwani R, Widom J. Models and issues in data streams. In: Popa L, ed. Proc. of the 21st ACM SIGACT-SIGMOD-SIGART Symp. on Principles of Database Systems. Madison: ACM Press, 2002. 1~16.
  • 9Terry D, Goldberg D, Nichols D, Oki B. Continuous queries over append-only databases. SIGMOD Record, 1992,21(2):321-330.
  • 10Avnur R, Hellerstein J. Eddies: Continuously adaptive query processing. In: Chen W, Naughton JF, Bernstein PA, eds. Proc. of the 2000 ACM SIGMOD Int'l Conf. on Management of Data. Dallas: ACM Press, 2000. 261~272.

共引文献261

同被引文献454

引证文献54

二级引证文献191

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部