摘要
通过分析现有短文本聚类算法的缺陷,提出了一种基于改进相似度与类中心向量的半监督短文本聚类算法。首先,定义强类别区分度词,利用已加标数据的类别信息提取并构造强类别区分度词集合,并对基于初始特征的余弦相似度和基于强类别区分度词项的相似度进行有效融合,得到更加合理的改进的短文本相似度计算公式。然后,通过计算样本与类中心向量的相似度实现对未分类样本的正确划分,与此同时,更新加标数据集合、类中心向量,重新抽取强类别区分度词。重复这个过程,直到实现所有数据的类别划分。实验表明:与其他同类算法相比,本文算法在聚类准确性和时间效率上有了较大的改进。
By analyzing the shortcomings of the existing short text clustering algorithms,a semi-supervised short text clustering algorithm based on improved similarity and class-center vector is proposed.Firstly,strong category differentiation word is defined,and the set of strong category differentiation words is constructed by using labeled data.Then,an effective short text similarity measurement method is designed by combining the similarity based on cosine theorem and the similarity based on strong category differentiation words.Secondly,the correct classification of the unclassified samples is achieved by calculating the similarity between the sample and the class-center vector.At the same time,the labeled data set and the class-center vector are updated,and the strong category differentiation words are extracted again.This process is repeated until all the data is divided into categories.Experiments show that,compared with other similar algorithms,the proposal can achieve both higher accuracy and better time efficiency.
作者
李晓红
冉宏艳
龚继恒
颜丽
马慧芳
LI Xiao-hong;RAN Hong-yan;GONG Ji-heng;YAN Li;MA Hui-fang(College of Computer Science and Engineering,Northwest Normal University,Lanzhou 730070,China)
出处
《计算机工程与科学》
CSCD
北大核心
2018年第9期1710-1716,共7页
Computer Engineering & Science
基金
国家自然科学基金(61163039)
甘肃省青年科技基金(1606RJYA269
145RJYA259)
甘肃省高等学校科研项目(2015A-008)
西北师范大学青年教师科研能力提升计划项目(NWNU-LKQN-14-5
NWNU-LKQN-16-20)
关键词
强类别区分度
相似度
类中心向量
半监督聚类
短文本
strong category differentiation
similarity
class-center vector
semi-supervised clustering
short text