摘要
不确定性数据聚类方法的研究日益受到广泛关注,其中UIDK-means算法与U-PAM算法继承了基于划分算法无法识别任意形状簇和对噪声点敏感的缺陷。FDBSCAN算法事先假定不确定性数据的概率分布函数或概率密度函数是已知的,然而这些信息在实际应用中往往难以获取。针对上述算法的不足,提出一种基于区间数的多维不确定性数据聚类UID-DBSCAN算法。该算法利用区间数结合数据的统计信息合理地表示不确定性数据,采用低计算复杂度的区间数距离函数衡量不确定性数据对象间的相似度,首次提出区间数的密度、密度可达与密度相连等概念,并将其用于扩展簇中,同时结合数据集的统计特征自适应地选取算法的密度参数来实现自动聚类。实验结果表明,UID-DBSCAN算法能够有效识别噪声,处理任意形状簇,具有较高的聚类精度和较低的计算复杂度。
The researches on clustering methods of uncertain data have been paid more and more attention,among them,the UIDK-means algorithm and U-PAM algorithm inherit the partition-based algorithm defects that can not identify any shape clusters and is sensitive to noise.FDBSCAN algorithm assumes that the probability distribution function or probability density function of uncertain data is known,however this information is hard to acquire.For the shortage of the above algorithms,a new multi-dimensional uncertain data clustering algorithm namely UID-DBSCAN based on interval numbers was proposed.It uses interval data combined with statistic information to describe uncertain data reasonably.And it utilizes the intervals distance function of low computing complexity to measure the similarity of different uncertain data.The concepts of interval density,interval density-reachable and interval density connected were firstly proposed and applied to expand clusters.Meanwhile in order to realize automatic clustering,combining with statistical features of the data,the parameters of density can be adaptively selected.Experiment results show that UID-DBSCAN algorithm can identify noise effectively,process arbitrary shape clusters and obtain better clustering precision with low computing complexity.
出处
《计算机科学》
CSCD
北大核心
2017年第B11期442-447,共6页
Computer Science
基金
水利部公益性行业科研专项(201401044)资助