基于连通分量的分类变量聚类算法被引量：4

A clustering algorithm for categorical variables based on connected components

导出

摘要针对分类变量相似度定义存在的不足,提出一种新的相似度定义.利用新的相似度定义,将数据集抽象为无向图,将聚类过程转化为求无向图连通分量的过程,进而提出一种基于连通分量的分类变量聚类算法.为了定量地分析该算法的聚类效果,针对类别归属已知的数据集,提出一种新的聚类结果评价指标.实验结果表明,所提出的算法具有较高的聚类精度和聚类效率. For the insufficient similarity concepts for categorical variables, a new more reasonable concept is proposed. Firstly, a data set is organized into an undirected graph by the new definition. The clustering process is converted into the problem of determining connected components in the undirected graph. Then a novel clustering algorithm for categorical variables based on connected components is proposed. In order to analyze the clustering results quantitatively, a new index is proposed for the known labels. Finally, the experimental results show that the proposed algorithm has a higher clustering precision and faster execution speed compared with several existing ones.

作者周红芳周扬张晓鹏谈姝辰

机构地区西安理工大学计算机科学与工程学院陕西应用物理化学研究所

出处《控制与决策》 EI CSCD 北大核心 2015年第1期39-45,共7页 Control and Decision

基金国家自然科学基金项目(61402363 61272284) 陕西省工业攻关项目(2014K05-49) 陕西省自然科学基础研究计划项目(2014JQ8361) 西安市碑林区科技计划项目(GX1405) 西安市科学计划项目(CXY1339(5)) 校特色研究计划项目(116-211302)

关键词聚类分类变量相似度连通分量聚类精度 clustering categorical variables similarity connected components clustering precision

分类号 TP311.13 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献16

1James B M. Some methods for classification and analysis of multivariate observations[C]. Proc of the 5th Berkeley Symposium on Mathematical Statistics and Probability. Berkeley: University of California Press, 1967: 281-297.
2Martin E, Hans P K, Jiirg S, et al. A density-based algorithm for discovering clusters in large spatial databases with noise[C]. Proc of the 2nd Int Conf on Knowledge Discovery and Data Mining. Portland: AAAI Press, 1996: 226-231.
3Wu S, Liu J J, Wei G. Clustering algorithm based on condensed set dissimilarity for high dimensional sparse data of categorical attributes[C]. Proc of the 3rd Int Conf on Advanced Computer Control. Harbin: IEEE Press, 2011: 445-448.
4Han J W, Kamber M. Data mining: Concepts and techniques[M]. Beijing: China Machine Press, 2008: 253- 260.
5Cao F Y, Liang J Y, Li D Y. A dissimilarity measure of the k-modes clustering algorithm[J]. Knowledge-Based Systems, 2012, 26(15): 120-127.
6Natthakan L, Tossapon B, Simon G, et al. A link- based cluster ensemble approach for categorical data clustering[J]. IEEE Trans on Knowledge and Data Engineering, 2012, 24(3): 413-425.
7Guha S, Rastogi R, Shim K. ROCK: A robust clustering algorithm for categorical attributes[C]. Proc of the 15th Int Conf on Data Engineering. Sydney: IEEE CS Press, 1999: 512-521.
8Joydeep G, Gunjan K G. Value balanced agglomerative connectivity clustering[C]. Proc of the 3rd Int Conf on Data Mining and Knowledge Discovery: Theory, Tools and Technology. Orlando: SPIE, 2001: 6-15.
9Dutta M, Dakoti M A, Pujari A K. QROCK: A quick version of the ROCK algorithm for clustering of categorical data[J]. Pattern Recognition Letters, 2005, 26(15): 2364- 2373.
10金阳,左万利.一种基于动态近邻选择模型的聚类算法[J].计算机学报,2007,30(5):756-762. 被引量：18

二级参考文献10

1Dubes R C, Jain A K. Algorithms for Clustering Data. Englewood Cliffs, NJ: Prentice Hall, 1988.
2Zhang Tian, Ramakrishnan Raghu, Livny Miron. Birch: An efficient data clustering method for very large databases// Proceedings of the ACM SIGMOD Conference on Management of Data. Montreal, Canada, 1996: 103-114.
3Guha S, Rastogi R, Shim K. ROCK:A robust clustering algorithm for categorical attributes//Proceedings of the 15th International Conference on Data Engineering. Sydney, Australia, 1999:1-11.
4Gupta G K, Ghosh J. Value balanced agglomerative connectivity clustering//Proceedings of the SPIE Conference on Data Mining and Knowledge Discovery Ⅲ. Orlando, USA, 2001:6-15.
5Dutta M, Kakoti Mahanta A, Pujari Arun K. QROCK: A quick version of the ROCK algorithm for clustering of categorical data. Pattern Recognition Letters, 2005, 26(15): 2364-2373.
6Gehrke J. New research directions in KDD. Report on the SIGKDD 2001 Conference Panel, SIGKDD Explorations,2002, 3(2): 76-77.
7Steinbach M, Karypis G, Kumar V. A comparison of document clustering techniques. Minnesota: University of Minnesota, Technical Report: 00-034, 2002.
8Sebastiani F. A tutorial on automatic text categorization// Proceedings of the 1st Argentinean Symposium on Artificial Intelligence (ASAI'99). Buenos Aires, AR, 1999:7-35.
9Larsen B, Aone C. Fast and effective text mining using linear-time document clustering//Proceedings of the 5th ACM SIGKDD. San Diego, CA, 1999:16-22.
10Kleinberg J, Papadimitriou C, Raghavan P. Segmentation problems//Proceedings of the 30th ACM Symposium on Theory of Computing. Duluth, MIN, USA, 1998:473-481.

共引文献17

1单世民,张宁,江贺,张宪超.基于网格和密度的簇边缘精度增强聚类算法[J].计算机工程与应用,2008,44(23):143-146. 被引量：4
2李琳,李肯立.基于图形处理器的层次聚类算法效率研究[J].计算机工程与应用,2008,44(31):53-56.
3曹付元,梁吉业,姜广.基于邻域模型的K-means初始聚类中心选择算法[J].计算机科学,2008,35(11):181-184. 被引量：6
4苗金凤,王洪国,邵增珍,赵学臣.基于多级搜索区域的协同进化遗传算法[J].计算机应用研究,2010,27(9):3345-3347. 被引量：6
5赵学锋,杨海斌,王秀花.一种基于动态近邻选择模型的Chameleon算法[J].西北师范大学学报（自然科学版）,2010,46(6):42-45. 被引量：1
6柴旭光.基于层次迭代思想的聚类算法的研究[J].邢台职业技术学院学报,2011,28(1):52-54.
7彭宏玉,柴旭光,陈晓纪.基于层次迭代思想的聚类算法的研究[J].唐山学院学报,2011,24(3):86-87. 被引量：3
8周涛,陆惠玲.数据挖掘中聚类算法研究进展[J].计算机工程与应用,2012,48(12):100-111. 被引量：145
9薛文娟,刘培玉,刘栋.引入共享近邻加权图的Chameleon算法[J].计算机应用,2012,32(10):2884-2887. 被引量：6
10刘晓娟,邓子渊.改进的嵌套分割算法及其在节能惰行中的应用[J].计算机工程与应用,2013,49(10):239-242. 被引量：2

同被引文献40

1王宜静,魏立力.基于聚类分析的多级综合评判[J].宁夏大学学报（自然科学版）,2005,26(1):30-33. 被引量：5
2卢启衡,冯晓红.基于宽度优先搜索的路径生成算法[J].现代计算机,2006,12(12):87-89. 被引量：7
3李爱国,洪炳镕,王司.基于错误传播分析的软件脆弱点识别方法研究[J].计算机学报,2007,30(11):1910-1921. 被引量：11
4李炜,石连生,梁成龙.基于PSO-LSSVM的研究法辛烷值预测建模[J].化工自动化及仪表,2008,35(2):25-27. 被引量：15
5刘昊,廖波,彭利红.基于蛋白质相互作用网络的聚类算法研究[J].计算机工程与应用,2008,44(30):142-144. 被引量：3
6周宏,郑浩然,李毅,李恒.基于强连通分量的^13C MFA计算模型稳定性判断[J].北京生物医学工程,2009,28(1):34-38. 被引量：1
7张进,胡明华,张晨,叶博嘉.空域复杂性建模[J].南京航空航天大学学报,2010,42(4):454-460. 被引量：25
8周小伟,袁俊,杨伯伦.应用BP神经网络的二次反应清洁汽油辛烷值预测[J].西安交通大学学报,2010,44(12):82-86. 被引量：20
9骆志刚,丁凡,蒋晓舟,石金龙.复杂网络社团发现算法研究新进展[J].国防科技大学学报,2011,33(1):47-52. 被引量：76
10丁悦,张阳,李战怀,王勇.图数据挖掘技术的研究与进展[J].计算机应用,2012,32(1):182-190. 被引量：14

引证文献4

1李云鹏,艾中良,刘忠麟,高泽,潘爽.基于混合流量数据的连通分量计算技术[J].信息技术,2020,44(1):140-143.
2陈子阳,陈伟,贾勇,周军锋.基于KST索引的最大连通Steiner分量查询算法[J].计算机学报,2020,43(7):1215-1229. 被引量：1
3徐宗煌.基于多元线性回归分析的汽油辛烷值损失预测建模[J].宁夏大学学报（自然科学版）,2022,43(1):22-29. 被引量：1
4张洪海,吕文颖,万俊强,杨磊.扇区空中交通风险态势网络建模与演化特征[J].交通运输工程学报,2023,23(1):222-241. 被引量：1

二级引证文献3

1李源,范晓林,孙晶,赵会群,杨森,王国仁.异质信息网络中最大路径连通Steiner分量查询算法[J].软件学报,2023,34(2):655-675.
2石宗北,张洪海,周锦伦,李一可.基于可视图的空中交通不安全事件时序特性分析[J].交通信息与安全,2024,42(2):12-24.
3曲培元,赵志斌,陈浩,徐东昕,刘国军.基于多元线性回归分析方法的汽车油耗(MPG)预测模型[J].统计学与应用,2022,11(2):206-215. 被引量：1

1赵亮,刘建辉,王星.基于Hellinger距离的混合数据集中分类变量相似度分析[J].计算机科学,2016,43(6):280-282. 被引量：8
2赵亮,刘建辉,张昭昭.基于贝叶斯距离的K-modes聚类算法[J].计算机工程与科学,2017,39(1):188-193. 被引量：5
3Eric M.Laflamme,Paul J.Ossenbruggen.Effect of time-of-day and day-of-the-week on congestion duration and breakdown:A case study at a bottleneck in Salem,NH[J].Journal of Traffic and Transportation Engineering(English Edition),2017,4(1):31-40. 被引量：1
4Hadgu Bariagaber.Housing Correlates of Under-Five Mortality in Urban Ethiopia[J].Sociology Study,2015,5(3):184-202.
5武森,冯小东,单志广.基于不完备数据聚类的缺失数据填补方法[J].计算机学报,2012,35(8):1726-1738. 被引量：63
6孙建国,姜烨,颜长珍.利用专题指数改善沙漠化土地遥感分类精度[J].遥感技术与应用,2013,28(4):655-658. 被引量：4
7阳维,林成德.有序分类与企业信用评级模型[J].杭州电子科技大学学报（自然科学版）,2005,25(6):44-47.
8文贡坚,王润生.基于模糊决策的快速识别多类目标的方法[J].模式识别与人工智能,1997,10(2):106-111. 被引量：3
9韩燕,王玲,罗冲.随机森林算法在干旱区土地利用遥感分类中的应用研究[J].石河子大学学报（自然科学版）,2017,35(1):95-101. 被引量：5
10胡广,李娟,黄本雄.结合空间信息的模糊C均值聚类图像分割算法[J].计算机与数字工程,2008,36(4):122-124. 被引量：6

控制与决策

2015年第1期

浏览历史

内容加载中请稍等...

基于连通分量的分类变量聚类算法被引量：4

参考文献16

二级参考文献10

共引文献17

同被引文献40

引证文献4

二级引证文献3

相关作者

相关机构

相关主题

浏览历史

基于连通分量的分类变量聚类算法 被引量：4

参考文献16

二级参考文献10

共引文献17

同被引文献40

引证文献4

二级引证文献3

相关作者

相关机构

相关主题

浏览历史

基于连通分量的分类变量聚类算法被引量：4