摘要
半监督文本聚类是文本聚类中的研究热点,广泛应用于数据挖掘和机器学习领域.现有基于划分和密度的半监督文本聚类算法不能适应多密度不平衡文本数据集的聚类.此外,基于向量空间的文档模型使用词或字向量表示文档特征,没有考虑到词组之间的关联性.针对以上问题,提出一种基于后缀树文档模型的半监督自适应多密度文本聚类算法.该算法基于后缀树文档模型表征文档间的相似度,使用K最近邻思想传播扩展簇标签,并在传播扩展过程中不断更新扩展阈值,以适应多密度不平衡的文本数据集.经实验验证,算法具有较高质量的聚类结果且能够适应多密度数据集.
Semi-Supervised document clustering is one of the focuses in the research area of document clustering, which is widely usedin the field of machine learning and data mining. Existing document clustering methods based on partition and density cannot adapt tomulti-density and imbalance document datasets. Besides, the traditional document feature model based on vector space just used wordsvector to represent document feature without taking into account the correlation between phrases. To address these problems, we pro-posed a novel semi-supervised adaptive multi-density method based on suffix tree document model. The algorithm uses the suffix treedocument model to compute similarity between documents, using k-nearest-neighbor method to propagate and expand the cluster labelsand updating the expansion threshold in the process to adapt the multi-density datasets. The experiments proved that our method hadhigher quality of clustering results and accommodate multi-density datasets.
出处
《小型微型计算机系统》
CSCD
北大核心
2016年第1期100-103,共4页
Journal of Chinese Computer Systems
基金
中国工程物理研究院科学技术发展基金课题项目(2012A0403021)资助
关键词
后缀树
半监督
多密度
文本聚类
suffix tree
semi-supervised
multi-density
document clustering