期刊文献+

一种基于信息熵的混合数据属性加权聚类算法 被引量:42

An Attribute Weighted Clustering Algorithm for Mixed Data Based on Information Entropy
下载PDF
导出
摘要 同时兼具数值型和分类型属性的混合数据在实际应用中普通存在,混合数据的聚类分析越来越受到广泛的关注.为解决高维混合数据聚类中属性加权问题,提出了一种基于信息熵的混合数据属性加权聚类算法,以提升模式发现的效果.工作主要包括:首先为了更加准确客观地度量对象与类之间的差异性,设计了针对混合数据的扩展欧氏距离;然后,在信息熵框架下利用类内信息熵和类间信息熵给出了聚类结果中类内抱团性及一个类与其余类分离度的统一度量机制,并基于此给出了一种属性重要性度量方法,进而设计了一种基于信息熵的属性加权混合数据聚类算法.在10个UCI数据集上的实验结果表明,提出的算法在4种聚类评价指标下优于传统的属性未加权聚类算法和已有的属性加权聚类算法,并通过统计显著性检验表明本文提出算法的聚类结果与已有算法聚类结果具有显著差异性. In real applications , mixed data sets with both numerical attributes and categorical attributes at the same time are more common . Recently , clustering analysis for mixed data has attracted more and more attention .In order to solve the problem of attribute weighting for high-dimensional mixed data ,this paper proposes an attribute weighted clustering algorithm for mixed data based on information entropy .The main work includes :an extended Euclidean distance is defined for mixed data , which can be used to measure the difference between the objects and clusters more accurately and objectively . And a generalized mechanism is presented to uniformly assess the compactness and separation of clusters based on within-cluster entropy and between-cluster entropy . Then a measure of the importance of attributes is given based on this mechanism .Furthermore ,an attribute weighted clustering algorithm for mixed data based on information entropy is developed .The effectiveness of the proposed algorithm is demonstrated in comparison with the widely used state -of-the-art clustering algorithms for ten real life datasets from UCI .Finally ,statistical test is conducted to show the superiority of the results produced by the proposed algorithm .
出处 《计算机研究与发展》 EI CSCD 北大核心 2016年第5期1018-1028,共11页 Journal of Computer Research and Development
基金 国家自然科学基金项目(61432011,U1435212,61402272) 国家“九七三”重点基础研究发展计划基金项目(2013CB329404) 山西省自然科学基金项目(2013021018-1)
关键词 聚类分析 混合数据 属性加权 信息熵 相异性度量 clustering analysis mixed data attribute weighting information entropy dissimilarity measure
  • 相关文献

参考文献3

二级参考文献30

  • 1李洁,高新波,焦李成.基于特征加权的模糊聚类新算法[J].电子学报,2006,34(1):89-92. 被引量:113
  • 2陈宗海,文锋,聂建斌,吴晓曙.基于节点生长k-均值聚类算法的强化学习方法[J].计算机研究与发展,2006,43(4):661-666. 被引量:13
  • 3Han Jiawei,Kamber M.Data Mining Concepts and Techniques[M].San Francisco:Morgan Kaufmann,2001.
  • 4Brendan J F,Delbert D.Clustering by passing messages between data points[J].Science,2007,315(16):972-976.
  • 5Zhang Jiangshe,Liang Yiuwing.Improved possibilistic c-means clustering algorithms[J].IEEE Trans on Fuzzy Systems,2004,12(2):209-217.
  • 6Mac Q J.Some methods for classification and analysis of multivariate observation[C]//Proc of the 5th Berkley Symp on Mathematical Statistics and Probability.Berkley,California:University of California Press,1967:281-297.
  • 7Huang Zhexue.Clustering large data sets with mixed numeric and categorical values[C]//Proc of PAKDD97.Singapore:World Scientific,1997:21-35.
  • 8Huang Zhexue.Extensions to the K-means algorithm for clustering large data sets with categorical values[J].Data Mining and Knowledge Discovery,1998,2(3):283-304.
  • 9Ng M K,Li Junjie,Huang Zhexue,et al.On the impact of dissimilarity measure in K-modes clustering algorithm[J].IEEE Trans on Pattern Analysis and Machine Intelligence,2007,29(3):503-507.
  • 10San O M,Huynh V N,Nakamori Y.An alternative extension of the K-means algorithm for clustering categorical data[J].Int Journal Application Mathematic and Computer Science,2004,14(2):241-247.

共引文献1101

同被引文献282

引证文献42

二级引证文献150

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部