摘要
半监督聚类能利用少量标记数据来提高聚类算法性能,但大部分文本聚类算法无法直接应用成对约束等先验信息。针对文本数据高维稀疏的特点,提出了一种半监督文本聚类算法。将成对约束信息扩展后嵌入文档相似度矩阵,在此基础上根据已划分与未划分文档之间的统计信息逐步找出剩余未划分文本集合中密集的且与已划分聚类中心集合相似度较小的K个初始聚类中心集合,然后将剩余的相对较难区分的文档结合成对约束限制信息划分到K个初始聚类中心集合,最后通过融合成对约束违反惩罚的收敛准则函数对聚类结果进行进一步优化。算法在聚类过程中自动确定初始聚类中心集合,避免了K均值算法对初始聚类中心选择的敏感性。在几个中英文数据集上的实验结果表明,所提算法能有效地利用少量的成对约束先验信息提高聚类效果。
Semi-supervised clustering can use a small amount of tag data to improve the clustering performance, but most of the text clustering algorithms can not directly apply priori information such as pairwise constraints. As the characteristics of text data were high-dimensional and sparse,we proposed a semi-supervised document clustering algo- rithrru First,pairwise constraints were expanded and embedded in the document similarity matrix, then K density regions which have a small similarity with the already partitioned text collection were gradually searched in the remaining unpartitioned text collection as initial centroid. The remaining unpartitioned texts which are relatively difficult to distin- guish were assigned to the K initial centroid according to the constraints. Finally, the clustering result was optimized by the convergence criterion function through integration of punish violations of pairwise constraints. In the clustering process,it can automatically determines the initial centroids to avoid the sensitivity to the initial centroids of K-means algorithm. Experimental results show that the proposed algorithm can effectively use a small amount of pairwise con- straints to improve the clustering performance in Chinese and English text datasets.
出处
《计算机科学》
CSCD
北大核心
2016年第12期183-188,共6页
Computer Science
关键词
聚类
半监督
向量空间模型
成对约束
文本
Clustering, Semi-supervised, VSM, Pairwise constraints, Text