期刊文献+

一种成对约束限制的半监督文本聚类算法 被引量:5

Pairwise Constrained Semi-supervised Text Clustering Algorithm
下载PDF
导出
摘要 半监督聚类能利用少量标记数据来提高聚类算法性能,但大部分文本聚类算法无法直接应用成对约束等先验信息。针对文本数据高维稀疏的特点,提出了一种半监督文本聚类算法。将成对约束信息扩展后嵌入文档相似度矩阵,在此基础上根据已划分与未划分文档之间的统计信息逐步找出剩余未划分文本集合中密集的且与已划分聚类中心集合相似度较小的K个初始聚类中心集合,然后将剩余的相对较难区分的文档结合成对约束限制信息划分到K个初始聚类中心集合,最后通过融合成对约束违反惩罚的收敛准则函数对聚类结果进行进一步优化。算法在聚类过程中自动确定初始聚类中心集合,避免了K均值算法对初始聚类中心选择的敏感性。在几个中英文数据集上的实验结果表明,所提算法能有效地利用少量的成对约束先验信息提高聚类效果。 Semi-supervised clustering can use a small amount of tag data to improve the clustering performance, but most of the text clustering algorithms can not directly apply priori information such as pairwise constraints. As the characteristics of text data were high-dimensional and sparse,we proposed a semi-supervised document clustering algo- rithrru First,pairwise constraints were expanded and embedded in the document similarity matrix, then K density regions which have a small similarity with the already partitioned text collection were gradually searched in the remaining unpartitioned text collection as initial centroid. The remaining unpartitioned texts which are relatively difficult to distin- guish were assigned to the K initial centroid according to the constraints. Finally, the clustering result was optimized by the convergence criterion function through integration of punish violations of pairwise constraints. In the clustering process,it can automatically determines the initial centroids to avoid the sensitivity to the initial centroids of K-means algorithm. Experimental results show that the proposed algorithm can effectively use a small amount of pairwise con- straints to improve the clustering performance in Chinese and English text datasets.
作者 王纵虎 刘速
出处 《计算机科学》 CSCD 北大核心 2016年第12期183-188,共6页 Computer Science
关键词 聚类 半监督 向量空间模型 成对约束 文本 Clustering, Semi-supervised, VSM, Pairwise constraints, Text
  • 相关文献

参考文献10

二级参考文献121

  • 1龙军,殷建平,祝恩,赵文涛.主动学习研究综述[J].计算机研究与发展,2008,45(z1):300-304. 被引量:31
  • 2李永森,杨善林,马溪骏,胡笑旋,陈增明.空间聚类算法中的K值优化问题研究[J].系统仿真学报,2006,18(3):573-576. 被引量:39
  • 3苏金树,张博锋,徐昕.基于机器学习的文本分类技术研究进展[J].软件学报,2006,17(9):1848-1859. 被引量:388
  • 4钱线,黄萱菁,吴立德.初始化K-means的谱方法[J].自动化学报,2007,33(4):342-346. 被引量:32
  • 5Basu S, Banerjee A, Mooney RJ. A probabilistic framework for semi-supervised clustering. In: Boulicaut JF, Esposito F, Giannotti F, Pedreschi D, eds. Proc. of the 10th ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining. New York: ACM Press, 2004.59-68.
  • 6Bilenko M, Basu S, Mooney RJ. Integrating constraints and metric learning in semi-supervised clustering. In: Brodley CE, ed. Proc. of the 21st Int'l Conf. on Machine Learning. New York: ACM Press, 2004. 81-88.
  • 7Tang W, Xiong H, Zhong S, Wu J. Enhancing semi-supervised clustering: a feature projection perspective. In: Berkhin P, Caruana R, Wu XD, eds. Proc. of the 13th ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining. New York: ACM Press, 2007. 707-716.
  • 8Basu S, Banerjee A, Mooney RJ. Active semi-supervision for pairwise constrained clustering. In: Jonker W, Petkovic M, eds. Proc. of the SIAM Int'l Conf. on Data Mining. Cambridge: MIT Press, 2004. 333-344.
  • 9Yan B, Domeniconi C. An adaptive kernel method for semi-supervised clustering. In: Fiirnkranz J, Scheffer T, Spiliopoulou M, eds. Proc. of the 17th European Conf. on Machine Learning. Berlin: Sigma Press, 2006. 18-22.
  • 10Yeung DY, Chang H. Extending the relevant component analysis algorithm for metric learning using both positive and negative equivalence constraints. Pattern Recognition, 2006,39(5):1007-1010.

共引文献491

同被引文献37

引证文献5

二级引证文献10

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部