摘要
针对在大规模数据集上进行聚类困难的问题,分析了抽样技术的优点,研究了数据挖掘领域中的随机抽样的特点,并在此基础上提出了一种基于密度的偏差抽样方法。利用密度偏差抽样所获得的样本数据集能够较准确地反映总体数据集的特征,并且能够灵活地控制对数据集不同区域的抽样率。实验证明,在大规模数据集上进行聚类时,密度偏差抽样在时间复杂度上要优于随机抽样。
The advantages of sampling technology were analyzed against the difficulties of clustering on large-scale data set, and study the traits of random sampling in data mining were studied then a biased sampling method based on density was presented. The sample data set using density biased sampling can more accurately reflect the character of the whole data set,and biased sampling can control the sampling rate freely as to different part of the data set. The experimental results show that, density biased sampling is superior to random sampling in time complexity when clustering on large-scale data set.
出处
《计算机科学》
CSCD
北大核心
2009年第2期207-209,264,共4页
Computer Science
基金
国家自然科学基金重点资助项目(70031010)
985哲学社会科学创新基地建设研究论文之一
"新世纪优秀人才支持计划"资助
关键词
数据挖掘
聚类
偏差抽样
随机抽样
Data mining, Clustering, Biased sampling, Random sampling