期刊文献+

基于可变网格划分的密度偏差抽样算法 被引量:7

Density biased sampling algorithm based on variable grid division
下载PDF
导出
摘要 简单随机抽样是在分析处理大规模数据集时最常用的数据约简方法,但该方法在处理内部分布不均匀的数据集时容易造成类的丢失。基于固定网格划分的密度偏差抽样算法虽能有效解决该问题,但其速度及效果易受网格划分粒度影响。为此提出了基于可变网格划分的密度偏差抽样算法,根据原始数据集每一维的分布特征确定该维相应的划分粒度,进而构建与原始数据集分布特征一致的网格空间。实验结果表明,在可变网格划分的基础上进行密度偏差抽样,样本质量明显提升,而且相对于基于固定网格划分的密度偏差抽样算法,抽样效率亦有所提高。 As the most commonly used method of reducing large-scale datasets, simple random sampling usually causes the loss of some clusters when dealing with unevenly distributed dataset. A density biased sampling algorithm based on grid can solve these defects, but both the efficiency and effect of sampling can be affected by the granularity of grid division. To overcome the shortcoming, a density biased sampling algorithm based on variable grid division was proposed. Every dimension of original dataset was divided according to the corresponding distribution, and the structure of the constructed grid was matched with the distribution of original dataset. The experimental results show that density biased sampling based on variable grid division can achieve higher quality of sample dataset and uses less execution time of sampling compared with the density biased sampling algorithm based on fixed grid division.
出处 《计算机应用》 CSCD 北大核心 2013年第9期2419-2422,共4页 journal of Computer Applications
基金 国家自然科学基金资助项目(61103129 61202312) 江苏省科技支撑计划项目(BE2009009)
关键词 密度偏差抽样 可变网格划分 数据挖掘 大规模数据集 聚类 density biased sampling variable grid division data mining large-scale dataset clustering
  • 相关文献

参考文献14

  • 1张春阳,周继恩,钱权,蔡庆生.抽样在数据挖掘中的应用研究[J].计算机科学,2004,31(2):126-128. 被引量:11
  • 2GU B H, HU F F, LIU H. Sampling and its application in data mining: a survey[ R]. Singapore: National University of Singapore, 2000.
  • 3PALMER C R, FALOUTSOS C. Density biased sampling: an im- proved method for data mining and clustering[ C]// Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data. New York: ACM Press, 2000:82 -92.
  • 4胡文瑜,孙志挥,吴英杰.数据挖掘取样方法研究[J].计算机研究与发展,2011,48(1):45-54. 被引量:54
  • 5NANOPOULOS A, THEODORIDS Y, MANOLOPOULOS Y. In- dexed-based density biased sampling for clustering applications[ J].Data & Knowledge Engineering, 2006, 57(1) : 37 -63.
  • 6APPEL A P, PATERLINI A A, de SOUSA E P M, et al. A densi- ty-biased sampling technique to improve cluster representativeness [ C]// Proceedings of PKDD 2007. Berlin: Springer, 2007:366 - 373.
  • 7HUANG J B, SUN H L, KANG J M, et al. ESC: an efficient syn- chronization-based clustering algorithm [ J]. Knowledge-Based Sys- tems, 2013, 40". 111 - 122.
  • 8唐成龙,邢长征.基于数据分区和网格的离群点挖掘算法[J].计算机应用,2012,32(8):2193-2197. 被引量:2
  • 9余波,朱东华,刘嵩,郑涛.密度偏差抽样技术在聚类算法中的应用研究[J].计算机科学,2009,36(2):207-209. 被引量:7
  • 10ZHAO Y C, CAO J, ZHANG C Q, et al. Enhancing grid-density based clustering for high dimensional data[ J]. Journal of Systems and Software, 2011,84(9) : 1524 - 1539.

二级参考文献101

共引文献78

同被引文献67

引证文献7

二级引证文献40

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部