期刊文献+

基于word2vec词模型的中文短文本分类方法 被引量:29

Chinese short text classification method based on word2vec embedding
原文传递
导出
摘要 针对受字数限定影响的文本特征表达能力弱成为短文本分类中制约效果的主要问题,提出基于word2vec维基百科词模型的中文短文本分类方法(chinese short text classification method based on embedding trained by word2vec from wikipedia, CSTC-EWW),并针对新浪爱问4个主题的短文本集进行相关试验。首先训练维基百科语料库并获取word2vec词模型,然后建立基于此模型的短文本特征,通过SVM、贝叶斯等经典分类器对短文本进行分类。试验结果表明:本研究提出的方法可以有效进行短文本分类,最好情况下的F-度量值可达到81.8%;和词袋(bag-of-words, BOW)模型结合词频-逆文件频率(term frequency-inverse document frequency, TF-IDF)加权表达特征的短文本分类方法以及同样引入外来维基百科语料扩充特征的短文本分类方法相比,本研究分类效果更好,最好情况下的F-度量提高45.2%。 In the short text classification process, the weak feature expression of the limitation of the number of words restricted the classification effect. To solve this problem, a Chinese short text classification method based on embedding trained by word2 vec from Wikipedia(CSTC-EWW) was proposed, and a series of experiments for short texts with 4 topics from the iask.com website were finished. This method firstly trained the embedding by word2 vec from Wikipedia corpus. the feature of short text based on the embedding was established. Naive Bayes and SVM was used to classify short text. The experimental results showed the following conclusions: CSTC-EWW could effectively classify short texts and the best F-value could reach 81.8%;Comparing the text feature expression of BOW model weighted by TF-IDF and the method of extending feature from Wikipedia, the classification results of CSTC-EWW were significantly better and F-measure of CSTC-EWW on car could be increased by 45.2%.
作者 高明霞 李经纬 GAO Mingxia;LI Jingwei(Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China)
出处 《山东大学学报(工学版)》 CAS CSCD 北大核心 2019年第2期34-41,共8页 Journal of Shandong University(Engineering Science)
基金 北京市MRI和脑信息重点试验室基金(20160201) 数字出版国家重点试验室基金(Q5007013201501) 计算机学院院级科研项目(2018JSJKY008)
关键词 短文本 中文文本分类 维基百科 word2vec 词嵌入 short texts Chinese text classification Wikipedia word2vec embedding
  • 相关文献

参考文献7

二级参考文献175

共引文献562

同被引文献311

引证文献29

二级引证文献147

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部