期刊文献+

基于相似度均值的分类数据层次聚类分析算法 被引量:10

A Hierarchical Clustering Analysis Algorithm of Categorical Data Based on Mean of Similarity
下载PDF
导出
摘要 层次聚类分析在数据挖掘与机器学习等领域是一种广泛使用的无监督学习技术,但是,由于层次聚类分析算法主要是依赖于人为设定的相似度阈值来实现聚类簇的合并或分裂,因此在没有任何先验知识时,难以设定相似度阈值。采用相似度均值以及边界数据对象分配策略,提出了一种基于相似度均值的分类数据层次聚类分析算法。该算法利用相似度均值刻画数据集中数据对象分布的集中趋势以及平稳相似性度量,作为层次聚类簇合并或分裂的重要依据,给出了一种相似度均值的计算公式,从而可以自动确定相似度阈值,解决了层次聚类分析中相似度阈值参数的人为设定问题;利用相似度均值,给出了一种边界数据对象的分配策略,有效提高了边界数据对象分配的准确性及聚类质量。在UCI与人工合成数据集上的实验验证了该算法具有良好的聚类性能和抗噪性,以及相似度均值的稳定性和有效性。 Hierarchical clustering analysis is a widely used unsupervised learning technology in the fields of data mining and machine learning.However,it is difficult to set the similarity threshold without any prior knowledge,since the hierarchical clustering analysis algorithm mainly relies on the similarity thresholds by artificial setting to realize the merging or splitting of clusters.Based on the mean of similarity and boundary data object allocation strategy,a hierarchical clustering analysis algorithm of categorical data using the mean of similarity is proposed.As an important basis for the merging or splitting of clusters in hierarchical clustering,the algorithm uses the steady similarity measure and the mean of similarity can capture the central tendency of the distribution of data objects in the data sets.A calculation formula of the mean of similarity is given,which can automatically determine the similarity threshold and solve the artificial setting of the similarity threshold parameters in the hierarchical clustering analysis.A boundary data object allocation strategy is presented by using the mean of similarity,which can effectively improve the accuracy of boundary data objects allocation and clustering quality.Experimental results validate the excellent clustering performance and anti-noise,as well as the stability and effectiveness of the algorithm’s mean of similarity on UCI and artificial data sets.
作者 褚轲欣 荀亚玲 CHU Ke-xin;XUN Ya-ling(School of Computer Science and Technology,Taiyuan University of Science and Technology,Taiyuan 030024,China)
出处 《计算机技术与发展》 2022年第11期154-163,共10页 Computer Technology and Development
基金 国家自然科学基金项目(61602335) 山西省自然科学基金(201901D211302)。
关键词 层次聚类 分类数据 相似度均值 平稳相似性度量 分配策略 hierarchical clustering categorical data mean of similarity steady similarity measure allocation strategy
  • 相关文献

参考文献7

二级参考文献70

  • 1Adamic L A, Glance N. The political blogosphere and the 2004 US election: divided they blog. In: Proceedings of the 3rd International Workshop on the Weblogging Ecosystem, New York, USA: ACM, 2005. 36-43.
  • 2Jeong H, Mason S, Barabasi A L, Oltvai Z N. Lethality and centrality in protein networks. Nature, 2001, 411(6833): 41-42.
  • 3Ahn Y Y, Bagrow J P, Lehmann S. Link communities reveal multiscale complexity in networks. Nature, 2011, 466(7307): 761-764.
  • 4Gregory S. Fuzzy overlapping communities in networks. Journal of Statistical Mechanics: Theory and Experiment, 2011, 2:P02017.
  • 5Newman M E J. The structure and function of complex networks. SIAM Review, 2003, 45(2): 167-256.
  • 6Scheffer M. Complex systems: foreseeing tipping points. Nature, 2010, 467(7314): 411-412.
  • 7Newman M E J. Networks: an Introduction. New York: Oxford University Press. 2010.
  • 8Newman M E J. Scientific collaboration networks: I. network construction and fundamental results. Physical Review E, 2001, 64(1): 016131.
  • 9Zeng J, Cheung W K, Li C H, Liu J M. Coauthor network topic models with application to expert finding. In: Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology. Toronto, Canada: IEEE, 2010. 366-373.
  • 10Guimera R, Danon L, Dfaz-Guilera A, Giralt F, Arenas A. Self-similar community structure in a network of human interactions. Physical Review E, 2003, 68(6): 065103.

共引文献103

同被引文献113

引证文献10

二级引证文献3

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部