摘要
信息瓶颈(Information Bottleneck,IB)方法在处理非平衡数据集时,倾向于将大簇中的数据对象划分到数据规模较小的小簇中,造成了聚类效果不理想的问题。针对该问题,提出了一种面向非平衡数据的多簇信息瓶颈算法(McIB)。McIB算法采用向下抽样方法来降低非平衡数据集的倾斜度,使用先划分再学习后合并的策略来优化IB算法处理非平衡数据的合并抽取过程。整个算法包含3步:首先根据分离标准来确定抽样比例参数;然后对数据进行初步的聚类,生成可信赖的多个簇;最后再利用簇之间的相似性对簇进行合并,组织多个簇代表每个实际的簇来得到最终的聚类结果。实验结果表明:所提算法能够有效地解决IB方法在非平衡数据集上的"均匀效应"问题;与其他聚类算法相比,McIB算法的性能更优。
When dealing with imbalanced data sets, the original IB method tends to produce clusters of relatively uni- form size,resulting in the problem of unsatisfactory clustering effect. To solve this problem, this paper proposesd a multi-clusters information bottleneck (McIB) algorithm. McIB algorithm tries to reduce the skewness of the data distri- butions by under-sampling method to divide the imbalanced data sets into multiple relatively uniform size clusters. Entire algorithm consists of three steps. First, a dividing measurement standard is proposed to determine the sampling ratio parameter. Second, McIB algorithm preliminary analyses the data to generate reliable multi-clusters. At last, McIB algo- rithm merges clusters into one bigger size cluster according to the similarity between clusters and organizes multiple clusters representing the actual cluster to obtain the final clustering results. Experimental results show that the McIB algorithm can effectively mine the pattern resided in imbalanced data sets. Compared with other common clustering al- gorithms, the performance of the McIB algorithm is better.
出处
《计算机科学》
CSCD
北大核心
2016年第7期245-250,共6页
Computer Science
基金
国家自然科学基金项目:多变量IB方法及算法的研究(61170223)
国家自然科学基金联合基金项目:可扩展迁移学习中跨媒体复杂问题自动映射研究(U1204610)资助
关键词
聚类
IB算法
非平衡数据
多簇
簇合并
Clustering, Information bottleneck method, Imbalanced data, Multi-clusters, Cluster merging