摘要
针对文本在聚类或分类时,由于数据高维稀疏导致相似度值低的问题,提出一种基于改进文本相似度计算的聚类方法.首先,利用向量空间模型VSM表示文本,采用余弦函数计算文本之间的相似度;然后,基于网络中节点的相似性传播原理,通过设置阈值找到与各个文本相似度较大的文本集合,进而使用Jaccard系数将两个文本之间相似度计算转化为两个文本集合之间的相似度计算;最后根据得到的文本相似度矩阵,利用谱聚类算法对文本进行聚类.在WebKB上的实验结果表明,与传统的K-means、谱聚类方法相比,该方法提高了聚类的准确度,召回率与F值.
When clustering or classifying texts, high dimensional and sparse data maybe lead to low similarity. As for this problem, this paper proposed a clustering method based on improved textual similarity calculation. Firstly, we use VSM to represent the texts, and used cosine function to calculate the similarity between texts. Then, based on the similarity propagation principle of the nodes in network, for each text, we selected the corresponding texts set with greater similarity by setting a threshold, to further calculate the similarity between each two texts sets by using Jaccard coefficients. Finally, according to the obtained text similarity matrix, we used the spectral clustering algorithm to conduct text clustering. Experimental results on WebKb dataset show that our proposed method improves the clustering accuracy, recall and F-value compared with the traditional algorithms of K-means and spectral clustering.
作者
李征
李斌
LI Zheng;LI Bin(School of Computer and Information Engineering,Henan University,Henan Kaifeng 475004,China;Key Laboratory of Intelligent Vision Monitoring for Hydropower Project of Hubei Province,China Three Gorges University,Hubei Yichang 443002,China)
出处
《河南大学学报(自然科学版)》
CAS
2018年第4期415-420,共6页
Journal of Henan University:Natural Science
基金
国家重点基础研究发展计划(973)项目(2014CB340404)
国家自然科学基金资助项目(61402150
61402151)
中国博士后科学基金资助项目(2016M592286)
河南大学科研基金项目(2013YBZR015)
三峡大学水电工程智能视觉监测湖北省重点实验室开放基金项目(2016KLA04)
河南省科技研发专项(182102410063)