摘要
当前流行的聚类集成算法无法依据不同数据集的不同特点给出恰当的处理方案,为此提出一种新的基于数据集特点的增强聚类集成算法,该算法由基聚类器的生成、基聚类器的选择与共识函数构成。该算法依据数据集的特点,通过启发式方法,选出合适的基聚类器,构建最终的基聚类器集合,并产生最终聚类结果。实验中,对ecoli,leukaemia与Vehicle三个基准数据集进行了聚类,所提出算法的聚类误差分别是0.014,0.489,0.479,同基于Bagging的结构化集成(BSEA)、异构聚类集成(HCE)和基于聚类的集成分类(COEC)算法相比,所提出算法的聚类误差始终最低;而在增加候基聚类器的情况下,所提出算法的标准化互信息(NMI)值始终高于对比算法。实验结果表明,同对比的聚类集成算法相比,所提出算法的聚类精度最高,可伸缩性最强。
The popular clustering ensemble algorithms cannot give the appropriate treatment program in the light of the different characteristics of the different data sets.A new clustering ensemble algorithm — Enhanced Clustering Ensemble algorithm based on Characteristics of Data sets(ECECD) was proposed for overcoming this defect.ECECD was composed of generation of base clustering,selection of base clustering and consensus function.It selected a special range of ensemble members to form the final ensemble and produced the final clustering based on the characteristic of the data set.Three Benchmark data sets including ecoli,leukaemia and Vehicle were clustered in the experiment,and the clustering errors gained by the proposed algorithm were 0.014,0.489 and 0.361 respectively,which were always the minimum compared with that of the other algorithms such as Bagging based Structure Ensemble Approach(BSEA),Hybrid Cluster Ensemble(HCE) and Cluster-Oriented Ensemble Classifier(COES).The Normalized Mutual Information(NMI) values of the proposed algorithm were also always higher than that of these algorithms when increasing candidate base clusterings.Therefore,compared with these popular clustering ensemble algorithms,the proposed algorithm has the highest clustering precision and the strongest scalability.
出处
《计算机应用》
CSCD
北大核心
2013年第8期2204-2207,2249,共5页
journal of Computer Applications
基金
山东省企业培训与职工教育课题资助项目(2012-277)
潍坊市社科规划重点课题资助项目(潍社科学术委发[2011]2号)
山东省高校人文社科研究计划项目(J08WG71)
关键词
基聚类器
共识函数
聚类集成算法
聚类误差
自适应性
标准化互信息
base clustering
consensus function
clustering ensemble algorithm
clustering error
adaptivity
Normalized Mutual Information(NMI)