摘要
半监督聚类研究如何利用少量的监督信息来提高聚类性能,目前已经成为机器学习领域的一个研究热点.现有的大多数半监督聚类方法没有综合考虑Seeds集和成对约束这两种监督信息,因而提出了一种基于Seeds集和成对约束的半监督聚类算法.该算法运用Tri-training算法扩充Seeds集,结合成对约束优化Seeds集并指导聚类过程.实验结果表明,该算法能够有效提高聚类性能.
Abstract:Semi-supervised learning, a kind of application-driven machine learning method, has become one of the hot topics of artificial intelligence and pattern recognition. As the main branch of semi-supervised learning, semi- supervised clustering gives a small amount of supervision information into the search process of optimal clustering. Recently, kinds of semi-supervised clustering algorithms are proposed, such as methods based on search, methods based on similarity, methods based on search and similarity. However, most current semi-supervised clustering algorithms don't use valuable seeds and pair-wise constraints at the same time. Therefore, a semi-supervised clustering algorithm based on seeds and pair-wise constraints is introduced, in order to make full use of given supervision information. In addition, Tri-training algorithm is a representative method based on Co-training mechanism. Considering that Tri-training algorithm can use three classifiers to label unlabeled samples, the proposed algorithm will utilize it to get more labeled samples. Firstly, based on Tri-training method, some unlabeledsamples are selected and annotated, to enlarge the number of initial labeled samples. Secondly, pair wise constraints are utilized to optimize enlarged labeled samples, with the purpose of improving its quality. Thirdly, initial clustering centers are acquired by optimized labeled samples. Finally, K-Means algorithm is carried out, and in the search process, pair-wise constraints are used to modify the partitioning results each time. Furthermore the proposed algorithm is compared with K-Means, Seeded-K-Means and COP-K-Means algorithm. And experimental results on three UCI data sets in same setting demonstrate that this method can take full advantage o{ given supervision information and get a better clustering result. Moreover, the experiment in Haberman data set is conducted to analyze relative impact on the algorithm's performance of pair-wise constraints and labeled samples numbers. Experimental results illustrate that the more pair-wise constraints numbers, or the more labeled samples numbers, the better this algorithm's performance.
出处
《南京大学学报(自然科学版)》
CAS
CSCD
北大核心
2012年第4期405-411,共7页
Journal of Nanjing University(Natural Science)
基金
国家自然科学基金(71031006
70971080)
国家"973"计划前期研究专项课题(2011CB311805)
高等学校博士学科点专项科研基金(20101401110002)
关键词
半监督聚类
Seeds集
成对约束
semi-supervised clustering, seeds, pair-wise constraints