一种基于Seeds集和成对约束的半监督聚类算法被引量：7

A semi-supervised clustering algorithm based on seeds and pair-wise constraints

下载PDF

导出

摘要半监督聚类研究如何利用少量的监督信息来提高聚类性能,目前已经成为机器学习领域的一个研究热点.现有的大多数半监督聚类方法没有综合考虑Seeds集和成对约束这两种监督信息,因而提出了一种基于Seeds集和成对约束的半监督聚类算法.该算法运用Tri-training算法扩充Seeds集,结合成对约束优化Seeds集并指导聚类过程.实验结果表明,该算法能够有效提高聚类性能. Abstract：Semi-supervised learning, a kind of application-driven machine learning method, has become one of the hot topics of artificial intelligence and pattern recognition. As the main branch of semi-supervised learning, semi- supervised clustering gives a small amount of supervision information into the search process of optimal clustering. Recently, kinds of semi-supervised clustering algorithms are proposed, such as methods based on search, methods based on similarity, methods based on search and similarity. However, most current semi-supervised clustering algorithms don＇t use valuable seeds and pair-wise constraints at the same time. Therefore, a semi-supervised clustering algorithm based on seeds and pair-wise constraints is introduced, in order to make full use of given supervision information. In addition, Tri-training algorithm is a representative method based on Co-training mechanism. Considering that Tri-training algorithm can use three classifiers to label unlabeled samples, the proposed algorithm will utilize it to get more labeled samples. Firstly, based on Tri-training method, some unlabeledsamples are selected and annotated, to enlarge the number of initial labeled samples. Secondly, pair wise constraints are utilized to optimize enlarged labeled samples, with the purpose of improving its quality. Thirdly, initial clustering centers are acquired by optimized labeled samples. Finally, K-Means algorithm is carried out, and in the search process, pair-wise constraints are used to modify the partitioning results each time. Furthermore the proposed algorithm is compared with K-Means, Seeded-K-Means and COP-K-Means algorithm. And experimental results on three UCI data sets in same setting demonstrate that this method can take full advantage o{ given supervision information and get a better clustering result. Moreover, the experiment in Haberman data set is conducted to analyze relative impact on the algorithm＇s performance of pair-wise constraints and labeled samples numbers. Experimental results illustrate that the more pair-wise constraints numbers, or the more labeled samples numbers, the better this algorithm＇s performance.

作者常瑜梁吉业高嘉伟杨静

机构地区山西大学计算机与信息技术学院计算智能与中文信息处理教育部重点实验室

出处《南京大学学报（自然科学版）》 CAS CSCD 北大核心 2012年第4期405-411,共7页 Journal of Nanjing University（Natural Science）

基金国家自然科学基金(71031006 70971080) 国家"973"计划前期研究专项课题(2011CB311805) 高等学校博士学科点专项科研基金(20101401110002)

关键词半监督聚类 Seeds集成对约束 semi-supervised clustering, seeds, pair-wise constraints

分类号 TP181 [自动化与计算机技术—控制理论与控制工程]

引文网络
相关文献

参考文献17

1Zhu X J. Semi supervised learning literature survey. Technical Report 1530. University of Wisconsin, Madison, 2008.
2Pedrycz W. Algorithms of fuzzy clustering with partial supervision. Pattern Recognition Let- ters, 1985, 3:13-20.
3Basu S, Banerjee A, Mooney R J. Active semi supervision for pair-wise constrained clustering. Proceedings of the 2004 SIAM International Conference on Data Mining, 2004, 333-344.
4Demiriz A, Bennett K P, Embrechts M J. Semi-supervised clustering using genetic algo- rithms. Proceedings of Intelligent Engineering Systems through Artificial Neural Networks, New York, 1999, 809-814.
5Hillel A B, Hertz T, Shental N, et al. Learn- ing distance functions using equivalence rela tions. Proceedings of the 20th International Con ference on Machine Learning, Washington, 2003, 11-18.
6Xing EP, Ng AY, JordanMl, etal. Distance metric learning with application to clustering with side-information. Advances in Neural In- formation Processing Systems, 2003, 15 ; 505-512.
7Xu Q J, desJardins M, Wagstaf K. Constrained spectral clustering under a local proximity struc- ture assumption. Proceedings of the 18^th Inter- national Florida Artificial Intelligence Research Society Conference, AAAI Press, 2005, 866-867.
8Basu S, Bilenko A Mooney R J. A probabilistic framework for semi-supervised clustering. Pro- ceedings of the 10^th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Seattle, 2004, 59-68.
9Bilenko M, Basu S, Mooney R J. Intergating constraints and metric learning in semi-super-vised clustering. Proceedings of the 21^st Interna- tional Conference on Machine Learning, New York, 2004, 81-88.
10Yin X S, Chen S C, Hu E L, etal. Semi-su- pervised clustering with metric learning: An adaptive kernel method. Pattern Recognition, 2010, 43(4): 1320-1333.

二级参考文献55

1杨建林.基于文献集相似度的分类方法[J].情报学报,1999,18(S1):92-94. 被引量：5
2林春燕,朱东华.科学文献的模糊聚类算法[J].计算机应用,2004,24(11):66-67. 被引量：9
3Basu S, Banerjee A, Mooney RJ. A probabilistic framework for semi-supervised clustering. In: Boulicaut JF, Esposito F, Giannotti F, Pedreschi D, eds. Proc. of the 10th ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining. New York: ACM Press, 2004.59-68.
4Bilenko M, Basu S, Mooney RJ. Integrating constraints and metric learning in semi-supervised clustering. In: Brodley CE, ed. Proc. of the 21st Int'l Conf. on Machine Learning. New York: ACM Press, 2004. 81-88.
5Tang W, Xiong H, Zhong S, Wu J. Enhancing semi-supervised clustering: a feature projection perspective. In: Berkhin P, Caruana R, Wu XD, eds. Proc. of the 13th ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining. New York: ACM Press, 2007. 707-716.
6Basu S, Banerjee A, Mooney RJ. Active semi-supervision for pairwise constrained clustering. In: Jonker W, Petkovic M, eds. Proc. of the SIAM Int'l Conf. on Data Mining. Cambridge: MIT Press, 2004. 333-344.
7Yan B, Domeniconi C. An adaptive kernel method for semi-supervised clustering. In: Fiirnkranz J, Scheffer T, Spiliopoulou M, eds. Proc. of the 17th European Conf. on Machine Learning. Berlin: Sigma Press, 2006. 18-22.
8Yeung DY, Chang H. Extending the relevant component analysis algorithm for metric learning using both positive and negative equivalence constraints. Pattern Recognition, 2006,39(5):1007-1010.
9Beyer K, Goldstein J, Ramakrishnan R, Shaft U. When is "Nearest Neighbors Meaningful"? In: Beeri C, Buneman P, eds. Proc. of the Int'l Conf. on Database Theory. New York: ACM Press, 1999.217-235.
10Ding CH, Li T. Adaptive dimension reduction using discriminant analysis and K-means clustering. In: Ghahramani Z, ed. Proc. of the 19th Int'l Conf. on Machine Learning. New York: ACM Press, 2007.521-528.

共引文献78

1於跃成,刘彩生,生佳根.分布式约束一致高斯混合模型[J].南京理工大学学报,2013,37(6):799-806. 被引量：3
2胡正平,高文涛,万春艳.基于样本不确定性和代表性相结合的可控主动学习算法研究[J].燕山大学学报,2009,33(4):341-346. 被引量：4
3代劲,何中市,胡峰.基于云模型的连续属性决策表简化算法[J].南京大学学报（自然科学版）,2009,45(5):638-644. 被引量：5
4李昆仑,曹铮,曹丽苹,张超,刘明.半监督聚类的若干新进展[J].模式识别与人工智能,2009,22(5):735-742. 被引量：50
5梁吉业,高嘉伟,常瑜.半监督学习研究进展[J].山西大学学报（自然科学版）,2009,32(4):528-534. 被引量：32
6熊建斌,李振坤,刘怡俊.半监督聚类算法研究现状[J].现代计算机,2009,15(12):61-64. 被引量：4
7卢加磊,朱世华,丁香乾,黄跃华.基于Co-training的烟草原料数据优化分析[J].计算机与现代化,2010(2):176-179.
8赵倩,尚学群,王淼.基于seeds集和频繁项集挖掘的半监督聚类算法[J].计算机工程与应用,2010,46(8):123-126. 被引量：2
9吴术路,张俊峰,宋长新.基于成对约束的混合核函数KFCM图像分割算法[J].微电子学与计算机,2010,27(5):177-180. 被引量：2
10蔡晰,郭躬德,黄添强.用于化合物毒性预测的半监督分类算法[J].计算机工程与设计,2010,31(12):2838-2841.

同被引文献92

1姚天昉,娄德成.汉语语句主题语义倾向分析方法的研究[J].中文信息学报,2007,21(5):73-79. 被引量：77
2TSOUMAKAS G, KATAKIS I. Multi-label classification: an overview[J]. International Journal of Data Warehousing and Mining, 2007, 3(3): 1-13..
3ZHU Xiaojin. Semi-supervised learning literature survey [R]. Madison, USA: University of WisconsinMadison, 2008..
4ZHOU Zhihua, ZHANG Minling, HUANG Shengjun, et al. Multi-instance multi-label learning[J]. Artificial Intelligence, 2012, 176(1): 2291-2320..
5ZHANG Minling, ZHANG Kun. Multi-label learning by exploiting label dependency[C]//Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Washington, DC, USA,2010: 999-1007..
6BOUTELL M R, LUO Jiebo, SHEN Xipeng, et al. Learning multi-label scene classification[J]. Pattern Recognition, 2004, 37(9): 1757-1771..
7FURNKRANZ J, HULLERMEIER E, MENCIA E L, et al. Multi-label classification via calibrated label ranking[J]. Machine Learning, 2008, 73(2): 133-153..
8TSOUMAKAS G, VLAHAVAS I. Random k-labelsets: an ensemble method for multilabel classification[C]//Proceedings of the 18th European Conference on Machine Learning. Berlin: Springer, 2007: 406-417..
9ZHANG Minling, ZHOU Zhihua. ML-kNN: a lazy learning approach to multi-label learning[J]. Pattern Recognition, 2007, 40(7): 2038-2048.
10ELISSEEFF A, WESTON J. A kernel method for multi-labelled classification[M]//DIETTERICH T G, BECKER S, GHAHRAMANI Z. Advances in Neural Information Processing Systems 14. Cambridge, USA: The MIT Press, 2002: 681-687..

引证文献7

1刘杨磊,梁吉业,高嘉伟,杨静.基于Tri-training的半监督多标记学习算法[J].智能系统学报,2013,8(5):439-445. 被引量：4
2蒋润,顾春华,阮彤.基于Tri-training的评价单元识别[J].计算机应用,2014,34(4):1099-1104. 被引量：4
3周萍,秦永彬,黄瑞章.结合seeds集和LDA的半监督文本聚类算法[J].计算机工程与设计,2014,35(6):1994-1998. 被引量：1
4高嘉伟,梁吉业,刘杨磊,李茹.一种基于Tri-training的半监督多标记学习文档分类算法[J].中文信息学报,2015,29(1):104-110. 被引量：8
5彭太乐,张文俊,蓝建梁,谢志峰.基于半监督聚类的微视频标注方法[J].计算机应用研究,2016,33(3):948-952. 被引量：2
6景妮琴.基于先验信息遗传算法的图像分割[J].信息技术,2017,41(11):176-180. 被引量：1
7安强强,张峰,李赵兴,张雅琼.基于机器学习的图像分割研究[J].自动化与仪器仪表,2018,0(6):29-31. 被引量：4

二级引证文献22

1高嘉伟,梁吉业,刘杨磊,李茹.一种基于Tri-training的半监督多标记学习文档分类算法[J].中文信息学报,2015,29(1):104-110. 被引量：8
2蒋新华,高晟,廖律超,邹复民.半监督SVM分类算法的交通视频车辆检测方法[J].智能系统学报,2015,10(5):690-698. 被引量：6
3郭毅,黄磊.基于LPA和Tri-Training的半监督文本倾向性分类[J].北京交通大学学报,2015,39(6):114-121. 被引量：1
4杜思奇,李红莲,吕学强.基于汉语组块分析的情感标签抽取[J].情报理论与实践,2016,39(5):125-129. 被引量：4
5秦永彬,李解,黄瑞章,李晶.Semi-supervised Document Clustering Based on Latent Dirichlet Allocation (LDA)[J].Journal of Donghua University(English Edition),2016,33(5):685-688. 被引量：2
6陶雯,王杉杉,李荣雨.基于多标记学习改进算法的入侵检测系统研究[J].自动化仪表,2017,38(9):57-60. 被引量：1
7高嘉伟,刘建敏.一种面向轨迹信息的时序数据流异常检测算法[J].计算机工程,2018,44(5):25-32. 被引量：4
8安强强,张峰,李赵兴,张雅琼.基于机器学习的图像分割研究[J].自动化与仪器仪表,2018,0(6):29-31. 被引量：4
9李彩红,张慧娥,申海杰.K-means无监督机器学习算法在心脏CT图像分割中的应用[J].电脑知识与技术,2019,15(1):212-213. 被引量：3
10王雷,杨思春.基于改进Tri-training算法的中文问句分类[J].安徽工业大学学报（自然科学版）,2016,33(2):172-176. 被引量：1

1张晓平.几种新超像素算法的研究[J].控制工程,2015,22(5):902-907. 被引量：5
2赵倩,尚学群,王淼.基于seeds集和频繁项集挖掘的半监督聚类算法[J].计算机工程与应用,2010,46(8):123-126. 被引量：2
3Gentoo推出Seeds项目[J].开放系统世界,2006(11):37-37.
4周萍,秦永彬,黄瑞章.结合seeds集和LDA的半监督文本聚类算法[J].计算机工程与设计,2014,35(6):1994-1998. 被引量：1
5邓超,郭茂祖.基于Tri-Training和数据剪辑的半监督聚类算法[J].软件学报,2008,19(3):663-673. 被引量：30
6李昆仑,王哲,张娟,武倩,宋嵩.基于ELM集成和半监督聚类的SNS隐私保护[J].河北大学学报（自然科学版）,2013,33(1):84-89.
7王雷,杨思春.基于改进Tri-training算法的中文问句分类[J].安徽工业大学学报（自然科学版）,2016,33(2):172-176. 被引量：1
8张雁,林英,吕丹桔.基于Tri-Training算法的数据编辑技术[J].计算机与数字工程,2013,41(10):1583-1585.
9张雁,吕丹桔,吴保国.基于Tri-Training半监督分类算法的研究[J].计算机技术与发展,2013,23(7):77-79. 被引量：9
10张雁,吴保国,吕丹桔,林英.基于Tri-training的主动学习算法[J].计算机工程,2014,40(6):215-218. 被引量：3

南京大学学报（自然科学版）

2012年第4期

浏览历史

内容加载中请稍等...

一种基于Seeds集和成对约束的半监督聚类算法被引量：7

参考文献17

二级参考文献55

共引文献78

同被引文献92

引证文献7

二级引证文献22

相关作者

相关机构

相关主题

浏览历史

一种基于Seeds集和成对约束的半监督聚类算法 被引量：7

参考文献17

二级参考文献55

共引文献78

同被引文献92

引证文献7

二级引证文献22

相关作者

相关机构

相关主题

浏览历史

一种基于Seeds集和成对约束的半监督聚类算法被引量：7