期刊文献+

基于后缀树的半监督自适应多密度文本聚类算法 被引量:3

Semi-supervised Adaptive Multi-density Document Clustering Algorithm Based on Suffix Tree
下载PDF
导出
摘要 半监督文本聚类是文本聚类中的研究热点,广泛应用于数据挖掘和机器学习领域.现有基于划分和密度的半监督文本聚类算法不能适应多密度不平衡文本数据集的聚类.此外,基于向量空间的文档模型使用词或字向量表示文档特征,没有考虑到词组之间的关联性.针对以上问题,提出一种基于后缀树文档模型的半监督自适应多密度文本聚类算法.该算法基于后缀树文档模型表征文档间的相似度,使用K最近邻思想传播扩展簇标签,并在传播扩展过程中不断更新扩展阈值,以适应多密度不平衡的文本数据集.经实验验证,算法具有较高质量的聚类结果且能够适应多密度数据集. Semi-Supervised document clustering is one of the focuses in the research area of document clustering, which is widely usedin the field of machine learning and data mining. Existing document clustering methods based on partition and density cannot adapt tomulti-density and imbalance document datasets. Besides, the traditional document feature model based on vector space just used wordsvector to represent document feature without taking into account the correlation between phrases. To address these problems, we pro-posed a novel semi-supervised adaptive multi-density method based on suffix tree document model. The algorithm uses the suffix treedocument model to compute similarity between documents, using k-nearest-neighbor method to propagate and expand the cluster labelsand updating the expansion threshold in the process to adapt the multi-density datasets. The experiments proved that our method hadhigher quality of clustering results and accommodate multi-density datasets.
出处 《小型微型计算机系统》 CSCD 北大核心 2016年第1期100-103,共4页 Journal of Chinese Computer Systems
基金 中国工程物理研究院科学技术发展基金课题项目(2012A0403021)资助
关键词 后缀树 半监督 多密度 文本聚类 suffix tree semi-supervised multi-density document clustering
  • 相关文献

参考文献3

二级参考文献27

  • 1胡海波,王林.关于因特网自治系统的连接率的幂律关系[J].西安理工大学学报,2005,21(2):204-207. 被引量:6
  • 2Cheung P-M,Kwok J T.A regularization framework for multiple-instance learning[C].Proceedings of the International Conference on Machine Learning,2006,193-200.
  • 3Collobert R,Sinz F,Weston J,et al.Large scale transductive SVMs[J].Journal of Machine Learning Research,2006,7:1687-1712.
  • 4Hoi S C H,Liu W,Lyu M R,et al.Learning distance metrics with contextual constraints for image retrieval[C].Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2006,2072-2078.
  • 5Klein D,Kamvar S D,Manning C D.From instance level constraints to space-level constraints:making the most of prior knowledge in data clustering[C].Proceedings of the International Conference on Machine Learning,2002,307-314.
  • 6Noam Shental T H,Aharon Bar-Hillel,Weinshall D.Computing Gaussian mixture models with EM using equivalence constraints[C].Advances in Neural Information Processing Systems 16,2004.
  • 7Smola A J,Vishwanathan S,Hoffman T.Kernel methods for missing variables[C].Proceedings of the 10th International Workshop on Artificial Intelligence and Statistics,2005,325-332.
  • 8Valizadegan H,Jin R.Generalized maximum margin clustering and unsupervised kernel learning[C].Advances in Neural Information Processing Systems 19,2007,1417-1424.
  • 9Wagstaff K,Cardie C,Rogers S,et al.Constrained K-means clustering with background knowledge[C].Proceedings of the International Conference on Machine Learning,2001,577-584.
  • 10Xing E P,Ng A Y,Jordan M I,et al.Distance metric learning with application to clustering with side-information[C].Advances in Neural Information Processing Systems 15,2003.

共引文献27

同被引文献8

引证文献3

二级引证文献5

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部