期刊文献+

基于改进相似度与类中心向量的半监督短文本聚类算法 被引量:1

A semi-supervised short text clustering algorithm based on improved similarity and class-center vector
下载PDF
导出
摘要 通过分析现有短文本聚类算法的缺陷,提出了一种基于改进相似度与类中心向量的半监督短文本聚类算法。首先,定义强类别区分度词,利用已加标数据的类别信息提取并构造强类别区分度词集合,并对基于初始特征的余弦相似度和基于强类别区分度词项的相似度进行有效融合,得到更加合理的改进的短文本相似度计算公式。然后,通过计算样本与类中心向量的相似度实现对未分类样本的正确划分,与此同时,更新加标数据集合、类中心向量,重新抽取强类别区分度词。重复这个过程,直到实现所有数据的类别划分。实验表明:与其他同类算法相比,本文算法在聚类准确性和时间效率上有了较大的改进。 By analyzing the shortcomings of the existing short text clustering algorithms,a semi-supervised short text clustering algorithm based on improved similarity and class-center vector is proposed.Firstly,strong category differentiation word is defined,and the set of strong category differentiation words is constructed by using labeled data.Then,an effective short text similarity measurement method is designed by combining the similarity based on cosine theorem and the similarity based on strong category differentiation words.Secondly,the correct classification of the unclassified samples is achieved by calculating the similarity between the sample and the class-center vector.At the same time,the labeled data set and the class-center vector are updated,and the strong category differentiation words are extracted again.This process is repeated until all the data is divided into categories.Experiments show that,compared with other similar algorithms,the proposal can achieve both higher accuracy and better time efficiency.
作者 李晓红 冉宏艳 龚继恒 颜丽 马慧芳 LI Xiao-hong;RAN Hong-yan;GONG Ji-heng;YAN Li;MA Hui-fang(College of Computer Science and Engineering,Northwest Normal University,Lanzhou 730070,China)
出处 《计算机工程与科学》 CSCD 北大核心 2018年第9期1710-1716,共7页 Computer Engineering & Science
基金 国家自然科学基金(61163039) 甘肃省青年科技基金(1606RJYA269 145RJYA259) 甘肃省高等学校科研项目(2015A-008) 西北师范大学青年教师科研能力提升计划项目(NWNU-LKQN-14-5 NWNU-LKQN-16-20)
关键词 强类别区分度 相似度 类中心向量 半监督聚类 短文本 strong category differentiation similarity class-center vector semi-supervised clustering short text
  • 相关文献

参考文献6

二级参考文献63

  • 1杨剑,王珏,钟宁.流形上的Laplacian半监督回归[J].计算机研究与发展,2007,44(7):1121-1127. 被引量:15
  • 2王玲,薄列峰,焦李成.密度敏感的半监督谱聚类[J].软件学报,2007,18(10):2412-2422. 被引量:94
  • 3Chapelle O, Seholkopf B, Zien A. Semi-Supervised Learning. Cam- bridge, USA : MIT Press, 2006.
  • 4Zhu X J. Semi-Supervised Learning Literature Survey [ EB/OL]. [ 2008 - 07 - 19 ]. http ://www. leexiang, com/semi-supervised- learning-literature -survey.
  • 5Zhong S. Semi-Supervised Model-based Document Clustering: A Comparative Study. Machine Learning, 2006, 65( 1 ) : 3-29.
  • 6Dueck D, Frey B J. Non-metric Affinity Propagation for Unsuper- vised Image Categorization// Proc of the 11 th IEEE International Conference on Computer Vision. Rio de Janeiro, Brazil, 2007:1-8.
  • 7Bodenhofer U, Kothmeier A, Hochreiter S. APCluster: An R Pack- age for Affinity Propagation Clustering. Bioinfornmties, 2011, 27(17) : 2463-2464.
  • 8Wu X J. An Improved Extreme Learning Machine for Classification Problem Based on Affinity Propagation Clustering. International Journal of Advancements in Computing Technology, 2012, 4 (10) : 274 -280.
  • 9Culp M, Michailidis G. Graph-Based Semi-Supervised Learning. IEEE Trans on Pattern Analysis and Machine Intelligence, 2008, 30(1) : 174-179.
  • 10Huang R Z, Lam W. An Active Learning Framework for Semi- Supervised Document Clustering with Language Modeling. Data & Knowledge Engineering, 2009, 68 ( 1 ) : 49-67.

共引文献379

同被引文献15

引证文献1

二级引证文献7

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部