基于主动数据选取的半监督聚类算法

Novel semi-supervised clustering algorithm based on active data selection

下载PDF

导出

摘要利用少量标签数据获得较高聚类精度的半监督聚类技术是近年来数据挖掘和机器学习领域的研究热点。但是现有的半监督聚类算法在处理极少量标签数据和多密度不平衡数据集时的聚类精度比较低。基于主动学习技术研究标签数据选取,提出了一个新的半监督聚类算法。该算法结合最小生成树聚类和主动学习思想,选取包含信息较多的数据点作为标签数据,使用类KNN思想对类标签进行传播。通过在UCI标准数据集和模拟数据集上的测试,结果表明提出的算法比其他算法在处理多密度、不平衡数据集时有更高精度且稳定的聚类结果。 Semi-supervised clustering,which aims to significantly improve the clustering results using limited supervision, has inevitably been the research focus in data mining and machine learning in recent years. But the accuracy of existing semi-clustering algorithms is low when dealing with the datasets with little labeled data or the multi-density and unbalanced datasets. Based on the active learning, this paper studied the data selection and presented a novel semi-supervised clustering algorithm. It selected information-rich data as labeled data by combining the ideas of minimum spanning tree clustering and active learning,and then used the KNN-like technology to propagate labels. Evaluating on several UCI standard datasets and synthetic datasets,the results show that the proposed method has manifest higher accuracy and stable performance in comparison with others, even when the datasets are multi-density and unbalanced.

作者文平冷明伟陈晓云

机构地区兰州大学信息科学与工程学院上饶师范学院数学与计算机学院

出处《计算机应用研究》 CSCD 北大核心 2012年第8期2841-2844,共4页 Application Research of Computers

基金江西省教育厅科技课题资助项目(GJJ11609)

关键词数据挖掘半监督聚类主动学习标签数据数据选取最小生成树多密度数据集不平衡数据集 data mining semi-supervised clustering active learning labeled data data selection minimum spanning tree multi-density dataset unbalanced dataset

分类号 TP301.6 [自动化与计算机技术—计算机系统结构]

引文网络
相关文献

参考文献16

1WAGSTAFF K, CARDIE C, ROGERS S, et al. Constrained K-means clustering with background knowledge[ C]//Proe of the 18th Interna- tional Conference on Machine Learning. San Francisco:Morgan Kauf- mann ,2001:577-584.
2BASU S, BANERJEE A, MOONEY R. Semi-supervised clustering by seeding[ C]//Proc of the 19th International Conference on Machine Learning. San Francisco: Morgan Kaufmann ,2002:27- 34.
3DANG Yan-zhong, XUAN Zhao-guo, RONG Li-li, et al. A novel ini- tialization method for semi-supervised clustering[ C ]//Proc of the 4th International Conference on Knowledge Science, Engineering and Management. Berlin : Springer-Verlag, 2010 : 317 - 328.
4RUIZ C, SPILIOPOULOU M, MENASALYAS E. Density-based semi- supervised clustering[ J]. Data Mining and Knowledge Discovery, 2010,21 (3) :345-370.
5LELIS L, SANDER J. Semi-supervised density-based clustering[ C ]// Proc of the 9th IEEE International Conference on Data Mining. Washington DC : IEEE Computer Society,2009:842- 847.
6LEWIS D D, GALE W A. A sequential algorithm for training text elas- sifters[ C]//Proc of the 17th Annual International ACM SIGIR Con- ference on Research and Development in Information Retrieval. New York : Springer-Verlag, 1994 : 3-12.
7张春阳,周继恩,钱权,蔡庆生.抽样在数据挖掘中的应用研究[J].计算机科学,2004,31(2):126-128. 被引量：11
8BASU S, BANERJEE A, MOONEY R J. Active semi-supervision for pairwise constrained clustering [ C ]//Proc of the 4th SIAM Interna- tional Conference on Data Mining. 2004:333-344.
9HUANG Rui-zhang, LAM W, ZHANG Zhi-gang. Active learning of constraints for semi-supervised text clustering [ C ]//Proc of the 7th SIAM International Conference on Data Mining. 2007:113-124.
10GRIRA N, CRUCIANU M, BOUJEMAA N. Active semi-supervised fuzzy clustering [J]. Pattern Recognition, 2008, 41 ( 5 ) : 1834- 1844.

二级参考文献3

1KishL著倪加勋译.抽样调查[M].中国统计出版社,1997..
2HanJiawei MichelineKamber.数据挖掘概念与技术[M].北京：机械工业出版社,2001.152-160.
3王玲,薄列峰,焦李成.密度敏感的半监督谱聚类[J].软件学报,2007,18(10):2412-2422. 被引量：94

共引文献174

1常瑞花.基于密集度量元的近邻传播聚类算法[J].微电子学与计算机,2015,32(5):1-5. 被引量：1
2周涓,熊忠阳,张玉芳,任芳.基于最大最小距离法的多中心聚类算法[J].计算机应用,2006,26(6):1425-1427. 被引量：71
3马光志,张耀坤.一种新的两阶段抽样算法[J].计算机工程与科学,2007,29(7):64-66. 被引量：1
4余波,朱东华,刘嵩,郑涛.密度偏差抽样技术在聚类算法中的应用研究[J].计算机科学,2009,36(2):207-209. 被引量：7
5李昆仑,曹铮,曹丽苹,张超,刘明.半监督聚类的若干新进展[J].模式识别与人工智能,2009,22(5):735-742. 被引量：50
6梁吉业,高嘉伟,常瑜.半监督学习研究进展[J].山西大学学报（自然科学版）,2009,32(4):528-534. 被引量：32
7郝建柏,陈贤富,黄双福,杨俊.一种基于模糊近邻标签传递的半监督分类算法[J].微电子学与计算机,2010,27(2):30-33. 被引量：6
8郭景峰,马鑫,代军丽.基于文本—链接模型和近邻传播算法的网页聚类[J].计算机应用研究,2010,27(4):1255-1258. 被引量：3
9何海江,何文德,刘华富.集成最近邻规则的半监督顺序回归算法[J].计算机应用,2010,30(4):1022-1025. 被引量：1
10潘章明.半监督的自动聚类[J].计算机应用,2010,30(10):2614-2617. 被引量：2

1夏英,李克非,丰江帆.基于网格梯度的多密度聚类算法[J].计算机应用研究,2008,25(11):3278-3280. 被引量：4
2周悦来,谭建豪.基于网格和信息熵的多密度聚类算法[J].计算机系统应用,2011,20(10):189-192. 被引量：3
3金欣,王晶,沈奇威.分布式最小生成树聚类的设计与实现[J].计算机系统应用,2011,20(7):69-75. 被引量：1
4赵双柱.SCMDFC算法研究与应用[J].网络安全技术与应用,2014(4):85-86.
5邱保志,沈钧毅.基于扩展和网格的多密度聚类算法[J].控制与决策,2006,21(9):1011-1014. 被引量：25
6杨新武,杨丽军.基于交叉模型的改进遗传算法[J].控制与决策,2016,31(10):1837-1844. 被引量：25
7文平,刘渊,张春瑞.基于后缀树的半监督自适应多密度文本聚类算法[J].小型微型计算机系统,2016,37(1):100-103. 被引量：3
8庞春江,程伟想,牛为华.基于优化网格的最小生成树聚类算法[J].计算机应用与软件,2009,26(8):262-264.
9钱美旋,叶东毅.利用一维投影分析的无参数多密度聚类算法[J].小型微型计算机系统,2013,34(8):1866-1871. 被引量：9
10李凯佳,袁凌云,俞锐刚.基于粒子群优化和最小生成树聚类的能耗均衡算法[J].微电子学与计算机,2016,33(12):15-19. 被引量：2

计算机应用研究

2012年第8期

浏览历史

内容加载中请稍等...

基于主动数据选取的半监督聚类算法

参考文献16

二级参考文献3

共引文献174

相关作者

相关机构

相关主题

浏览历史