期刊文献+

基于类别主题词集的加权相似度短文本分类 被引量:1

Short Text Classification with Weighted Similarity Based on Category Topic Word Set
下载PDF
导出
摘要 由于短文本存在特征稀疏的问题,在分类问题上效果不佳,该文充分利用词向量模型,在词层面提出一种基于类别主题词集的加权相似度的短文本分类算法。首先训练词向量模型,其次使用TF-IDF选择出最能代表各类别的主题词形成类别主题词集,将短文本的关键词与各类别主题词分别进行相似度计算,将类别主题词对主题的贡献度表示在权重中,选择相似度最高的结果作为该短文本的类别。实验结果表明,基于类别主题词集的加权相似度短文本分类方法在精确率上相较KNN算法、Logistic回归算法、决策树分类算法分别提高了2.9%、1.8%、10.2%;在召回率上分别提升了3.0%、1.7%、10.4%。但是类别主题词对类别的贡献度量化维度简单。基于主题词集的加权相似度短文本分类算法在词的层面解决了短文本分类中的特征不足的问题,提高了短文本分类的性能。 Due to the problem of sparse features of short text,it is not effective in classification.We make full use of the word vector model and propose a short text classification algorithm based on the weighted similarity of the category topic word set at the word level.Firstly the word vector model is trained.TF-IDF is used to select the subject words that can best represent each category to form the category subject word set.The similarity between the keywords of the short text and the subject words of each category is calculated respectively.The contribution degree of the category subject words to the topic is expressed in the weight,and the result with the highest similarity is selected as the category of the short text.The experiment shows that the precision of the short text classification method based on the weighted similarity of the category topic word set is 2.9%,1.8%,and 10.2%higher than that of the KNN algorithm,the Logistic regression algorithm,and the decision tree classification algorithm respectively.The recall rate increased by 3.0%,1.7%,and 10.4%respectively.The metric dimension of the contribution of topic words to category is simple.The short text classification algorithm based on the weighted similarity of the topic word set solves the problem of insufficient features in short text classification at the word level,and improves the performance of short text classification.
作者 王小楠 黄卫东 WANG Xiao-nan;HUANG Wei-dong(School of Management,Nanjing University of Posts and Telecommunications,Nanjing 210003,China)
出处 《计算机技术与发展》 2022年第9期95-99,共5页 Computer Technology and Development
基金 国家自然科学基金项目(7217011293) 国家社会科学基金重大项目(16ZDA054) 江苏省研究生科研创新计划(KYCX21_0836)。
关键词 Word2Vec 短文本分类 相似度 类别主题 加权 Word2Vec short text classification similarity category topic weighting
  • 相关文献

参考文献10

二级参考文献70

共引文献127

同被引文献17

引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部