摘要
许多实际问题的解决不仅需要聚类算法给出类标,更依赖于类间远近关系的辨别.对于类数较多且高维数据的困难情况,基于降维的聚类结果可视化方法通常会出现聚类的重叠、交织或强行拉远现象,使得一些类间的远近关系无法分辨或被错误显示;而现有的类间距离方法则不能揭示两个聚类是远离还是靠近.本文提出了双几何体模型方法来描述两个聚类的类间关系,并设计了相对边界距离、绝对边界距离和区域疏密程度等测量类间远近程度的方法.本文方法既考虑了两个聚类的最近样本集之间的绝对距离,也考虑了聚类边界区域的疏密程度,其优点是在上述困难情况下也能准确揭示高维空间中的类间关系.对真实数据集的实验结果表明,双几何体模型方法能有效地识别现有聚类可视化方法无法辨别的类间远近关系.
When solving many practical problems, we not only need sample labels given by a clustering algo- rithm, but also rely on the recognition of far-near relations of clusters. Under the difficult condition of many clusters in a high-dimensional data set, the clustering visualization methods based on dimension reductions usu- ally produce the phenomena, e.g., some clusters are overlapping, interlacing, or pushed away; as a result, the far-near relations of some clusters are displayed wrongly or cannot be distinguished. The existing inter-cluster distance methods cannot determine whether two clusters are far away or near. The geometric double-entity model method (GDEM) is proposed to describe far-near relations of clusters, and the methods such as the rela- tive border distance, absolute border distance and region dense degree are designed to measure far-near degrees between clusters. GDEM pays attention to both the absolute distance between nearest sample sets and the dense degrees of border regions of two clusters, and it is able to uncover accurately far-near relations of clusters in a high-dimensional space, especially under the difficult condition mentioned above. The experimental results on four real data sets show that the proposed method can effectively recognize far-near relations of clusters, while the conventional methods cannot.
作者
王开军
严宣辉
陈黎飞
WANG KaiJun;YAN XuanHui;CHEN LiFei(School of Mathematics and Computer Science,Fujian Normal University,Fuzhou 350108,China)
出处
《中国科学:信息科学》
CSCD
2012年第1期99-110,共12页
Scientia Sinica(Informationis)
基金
福建省教育厅A类资助项目(批准号:JA09043)
福建省省属高校科研专项(批准号:JK2009006)资助项目
关键词
双几何体模型
聚类间远近关系
大类数
高维数据
划分聚类算法
geometric double-entity model, far-near relations of clusters, many clusters, high-dimensional dataset, partitional clustering algorithms