期刊文献+

一种基于改进相似度计算的文本聚类方法 被引量:3

A Text Clustering Method Based on Improved Similarity Calculation
原文传递
导出
摘要 针对文本在聚类或分类时,由于数据高维稀疏导致相似度值低的问题,提出一种基于改进文本相似度计算的聚类方法.首先,利用向量空间模型VSM表示文本,采用余弦函数计算文本之间的相似度;然后,基于网络中节点的相似性传播原理,通过设置阈值找到与各个文本相似度较大的文本集合,进而使用Jaccard系数将两个文本之间相似度计算转化为两个文本集合之间的相似度计算;最后根据得到的文本相似度矩阵,利用谱聚类算法对文本进行聚类.在WebKB上的实验结果表明,与传统的K-means、谱聚类方法相比,该方法提高了聚类的准确度,召回率与F值. When clustering or classifying texts, high dimensional and sparse data maybe lead to low similarity. As for this problem, this paper proposed a clustering method based on improved textual similarity calculation. Firstly, we use VSM to represent the texts, and used cosine function to calculate the similarity between texts. Then, based on the similarity propagation principle of the nodes in network, for each text, we selected the corresponding texts set with greater similarity by setting a threshold, to further calculate the similarity between each two texts sets by using Jaccard coefficients. Finally, according to the obtained text similarity matrix, we used the spectral clustering algorithm to conduct text clustering. Experimental results on WebKb dataset show that our proposed method improves the clustering accuracy, recall and F-value compared with the traditional algorithms of K-means and spectral clustering.
作者 李征 李斌 LI Zheng;LI Bin(School of Computer and Information Engineering,Henan University,Henan Kaifeng 475004,China;Key Laboratory of Intelligent Vision Monitoring for Hydropower Project of Hubei Province,China Three Gorges University,Hubei Yichang 443002,China)
出处 《河南大学学报(自然科学版)》 CAS 2018年第4期415-420,共6页 Journal of Henan University:Natural Science
基金 国家重点基础研究发展计划(973)项目(2014CB340404) 国家自然科学基金资助项目(61402150 61402151) 中国博士后科学基金资助项目(2016M592286) 河南大学科研基金项目(2013YBZR015) 三峡大学水电工程智能视觉监测湖北省重点实验室开放基金项目(2016KLA04) 河南省科技研发专项(182102410063)
关键词 文本相似度 Jaccard系数 文本集合 谱聚类算法 text similarity Jaccard coefficient texts set spectral clustering algorithm
  • 相关文献

参考文献2

二级参考文献31

  • 1彭京,杨冬青,唐世渭,付艳,蒋汉奎.一种基于语义内积空间模型的文本聚类算法[J].计算机学报,2007,30(8):1354-1363. 被引量:44
  • 2Yang X, Ghoting A, Ruan Y, et al. A framework for summarizing and analyzing Twilter feeds [C] //Proc of the 18th ACM SIGKDD lnt Conf on Knowledge Discovery and Data Mining (KDD'12). New York: ACM, 2012:370-378.
  • 3Zhang X, Zhu S, Liang W. Detecting spare and promoting campaigns in the Twitter social network [C] //Proc of the 12th IEEE Int Conf on Data Mining (ICDM'12). Los Alamitos, CA: IEEEComputer Society, 2012:1194-1199.
  • 4Peng Min, Huang Jiaiia, Fu Hui, et al. High quality microblog extraction based on multiple features fusion and time frequency lransformation [G] //LNCS 8181 : Proc of the 14th Int Conf of Web Information Systems Engineering (WlSE'13). Berlin: Springer, 2013:188- 201.
  • 5Lin D. An information theoretic definition of similarity [C]// Proc of the 15th Int Conf on Machine I.earning (ICMI.'98). San Francisco, CA: Morgan Kaufmann, 1998, 296-304.
  • 6Schiitze H, Silverstein C. Projections for efficient document clustering [C] //Proc of the 20th Annual Int ACM SIGIR Conf on Research and Development in Information Retrieval (SIGIR'97). New York: ACM, 1997: 74-81.
  • 7Ramage D, Heymann P, Manning C D, et al. Clustering the tagged Web [C] //Proc of the 2nd ACM Int Conf on Web Search and Data Mining (WSDM'09). New York: ACM, 2009:54-63.
  • 8Freeman R, Yin H. Self-organising maps for hierarchical tree view document clustering using contextual information [G]//LNCS 2412: Proc of the IEEE Int Joint Conf on Neural Networks. Berlin: Springer, 2002:123-128.
  • 9Sahami M, Heilman T D. A Web based kernel function for measuring the similarity of short text snippets [C] //Proc of the 15th Int Conf on World Wide Web (WWW'06). New York: ACM, 2006: 377-386.
  • 10Bollegala D, Matsuo Y, Ishizuka M. Measuring semantic sinMarity between words using Web search engines [C]// Proc of the 16th Int Conf on World Wide Web (WWW'07). New York: ACM, 2007:757- 766.

共引文献34

同被引文献13

引证文献3

二级引证文献6

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部