期刊文献+

一种面向非平衡数据的多簇IB算法 被引量:2

Multi-clusters IB Algorithm for Imbalanced Data Set
下载PDF
导出
摘要 信息瓶颈(Information Bottleneck,IB)方法在处理非平衡数据集时,倾向于将大簇中的数据对象划分到数据规模较小的小簇中,造成了聚类效果不理想的问题。针对该问题,提出了一种面向非平衡数据的多簇信息瓶颈算法(McIB)。McIB算法采用向下抽样方法来降低非平衡数据集的倾斜度,使用先划分再学习后合并的策略来优化IB算法处理非平衡数据的合并抽取过程。整个算法包含3步:首先根据分离标准来确定抽样比例参数;然后对数据进行初步的聚类,生成可信赖的多个簇;最后再利用簇之间的相似性对簇进行合并,组织多个簇代表每个实际的簇来得到最终的聚类结果。实验结果表明:所提算法能够有效地解决IB方法在非平衡数据集上的"均匀效应"问题;与其他聚类算法相比,McIB算法的性能更优。 When dealing with imbalanced data sets, the original IB method tends to produce clusters of relatively uni- form size,resulting in the problem of unsatisfactory clustering effect. To solve this problem, this paper proposesd a multi-clusters information bottleneck (McIB) algorithm. McIB algorithm tries to reduce the skewness of the data distri- butions by under-sampling method to divide the imbalanced data sets into multiple relatively uniform size clusters. Entire algorithm consists of three steps. First, a dividing measurement standard is proposed to determine the sampling ratio parameter. Second, McIB algorithm preliminary analyses the data to generate reliable multi-clusters. At last, McIB algo- rithm merges clusters into one bigger size cluster according to the similarity between clusters and organizes multiple clusters representing the actual cluster to obtain the final clustering results. Experimental results show that the McIB algorithm can effectively mine the pattern resided in imbalanced data sets. Compared with other common clustering al- gorithms, the performance of the McIB algorithm is better.
出处 《计算机科学》 CSCD 北大核心 2016年第7期245-250,共6页 Computer Science
基金 国家自然科学基金项目:多变量IB方法及算法的研究(61170223) 国家自然科学基金联合基金项目:可扩展迁移学习中跨媒体复杂问题自动映射研究(U1204610)资助
关键词 聚类 IB算法 非平衡数据 多簇 簇合并 Clustering, Information bottleneck method, Imbalanced data, Multi-clusters, Cluster merging
  • 相关文献

参考文献2

二级参考文献48

  • 1Tan Pang-ning, Steinbach M. Introduction to Data Mining(第2版)[M].范明,范宏建,译.北京:人民邮电出版社,2011:127-187.
  • 2Sun Yan-min,Kamel M S,Wong A K C. Cost-sensitive boosting for classification of imbalanced data. Patter Recognition Society [J]. Published by Elsevier Ltd, 2007:3358-3378.
  • 3He Hai-bo, Garcia E A. Learning from imbalanced Data [J]. IEEE Transactions on Knowledge and Data Engineering, 2009, 21(9):1263-1284.
  • 4Visa S,Ralescu A. Issues in Mining imbalaneed Data Sets-A Review Paper[C]//Proc. of MidWest Artificial Intelligence and Cognitive Science Conference (MAICS'05). Dayon, 2005: 67-73.
  • 5Batista G E A P A,Prati R C,Monard M C. A study of the Behavior of several methods for balancing machine learning training data [J]. SIGKDD Explorations Special Issue on Learning from Imbalaneed Datasets, 2004,6 (1) : 20-29.
  • 6Japkowicz N, Stepen S. The class imbalance problem: a systematic study[J]. Intell. Data Anal. J. , 2002,6 (5): 429-450.
  • 7Weiss G,Provost F. Learning when training data are costly: the effect of class distribution on tree induction[J]. J. Aritif. Intell. Res. ,2003,19:315-354.
  • 8Joshi M V. Learning classifier models for predicting rare phenomena[D]. University of Minnesota, Twin Cites, MN, USA, 2002.
  • 9Japkowiez N, Stephen S. The class imbalance problem: a systematic study[J]. Intell. Data Anal. J., 2002,6(5): 429-450.
  • 10Japkowicz N. Concept-learning in the presence of between-class and within-elass imbalance[C] //Proceedings of the Fourteenth Conference of the Canadian Society for Computational Studies of Intelligenee. Ottawa,Canada,June 2001: 67-77.

共引文献9

同被引文献33

引证文献2

二级引证文献7

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部