摘要
伴随着问题场景数据在规模上的快速增长和构成上的复杂化,精确估计簇的个数和簇的中心点是当下聚类算法处理和分析复杂大规模数据的重要挑战.簇数及簇心的精确估计对于部分有参聚类算法、数据集整体复杂性度量和数据简化表示等都十分关键.文中在深入分析I-nice的基础上,提出基于候选中心融合的多观测点I-nice聚类算法.在原多观测点投影分治框架上采用混合高斯模型(Gaussian Mixture Model,GMM),结合粗细粒度最佳GMM搜索策略,实现数据子集的精确划分.此外,基于候选中心点分别到各观测点的距离值及最佳GMM,构造候选中心点的GMM构件向量,并设计一组闵可夫斯基距离对进行候选中心点间的相异度度量,实现基于GMM构件向量相异度的多观测点I-nice候选中心融合.不同于现有聚类算法,文中算法联合优化分治环节数据子集划分和候选中心集成这两个关键过程,实现成百上千个簇的精确高效估计.在真实数据集和仿真数据集上的一系列实验表明,文中算法能精确估计簇数和簇中心,具备较高的聚类精度.实验同时验证算法的有效性及在各类数据场景下的稳定性.
With the rapid growth of data scale and composition complexity in the real-world applications,it is an important challenge for current clustering algorithms to estimate the number and the centers of clusters accurately in processing and analyzing the complex and large-scale data.The accurate estimation of cluster number and cluster centers is crucial for partial parametric clustering algorithm,complexity measurement and simplified representation of dataset.In this paper,grounded on the in-depth analysis of I-nice,a multi-observation I-nice clustering algorithm based on candidate centers fusion(I-niceCF)is proposed.Based on the original multi-observation projection divide-and-conquer framework,Gaussian mixture model(GMM)is combined with the coarse-to-fine optimal mixture model search strategy to partition data subsets exactly.In addition,GMM component vectors of candidate centers are constructed based on the distance of candidate centers from each observation point and optimal GMMs.A Minkowski distance pair is designed to measure the dissimilarity between candidate centers.Finally,the candidate centers are fused based on the mixture component vectors.Different from the existing clustering algorithms,I-niceCF is jointly optimized by data subset partitioning of divide-and-conquer process and candidate centers fusion.Consequently,accurate and efficient estimation for hundreds of clusters is achieved.A series of experiments on real and synthetic datasets show that I-niceCF can estimate cluster number and cluster centers more accurately with higher clustering accuracy and its stability under various data scenarios is verified.
作者
陈鸿杰
何玉林
黄哲学
尹剑飞
CHEN Hongjie;HE Yulin;HUANG Zhexue;YIN Jianfei(Big Data Institute,College of Computer Science and Software Engineering,Shenzhen University,Shenzhen 518060;National Engineering Laboratory for Big Data System Computing Technology,Shenzhen University,Shenzhen 518060)
出处
《模式识别与人工智能》
EI
CSCD
北大核心
2022年第4期348-362,共15页
Pattern Recognition and Artificial Intelligence
基金
国家自然科学基金面上项目(No.61972261)
深圳市基础研究项目(No.JCYJ20210324093609026,JCYJ20200813091134001)。
关键词
无监督学习
观测点
I-nice
无参聚类
高斯混合模型
Unsupervised Learning
Observation Point
I-nice
Parameter-Free Clustering
Gaussian Mixture Model