期刊文献+

融合单词贡献度与Word2Vec词向量的文档表示 被引量:16

Document Representation Fused with Term Contribution and Word2Vec Word Vector
下载PDF
导出
摘要 针对现有文档向量表示方法受噪声词语影响和重要词语语义不完整的问题,通过融合单词贡献度与Word2Vec词向量提出一种新的文档表示方法。应用数据集训练Word2Vec模型,计算数据集中词语的贡献度,同时设置贡献度阈值,提取贡献度大于该阈值的单词构建单词集合。在此基础上,寻找文档与集合中共同存在的单词,获取其词向量并融合单词贡献度生成文档向量。实验结果表明,该方法在搜狗中文文本语料库和复旦大学中文文本分类语料库上分类的平均准确率、召回率和F1值均优于TF-IDF、均值Word2Vec、PTF-IDF加权Word2Vec模型等传统方法,同时其对英文文本也能进行有效分类。 The existing document vector representation methods are affected by noise words and the semantics of important words is incomplete.To address the problems,this paper proposes a new document representation method by fusing Term Contribution(TC)and Word2Vec word vector.Trained with a dataset,the Word2Vec model calculates the TC of words in the data set.Then the contribution threshold is set and the words whose TC is greater than the threshold are extracted to construct a word set.On this basic,the word that exists both in the document and the set is extracted,and its word vector is fused with the TC to generate the document vector.Experimental results show that the average accuracy,recall rate and F1 value of the proposed method on Sogou Chinese text corpus and Fudan University Chinese text classification corpus are better than those of traditional methods such as TF-IDF,mean Word2Vec and PIF-IDF weighted Word2Vec models.Meanwhile,it can also effectively classify English texts.
作者 彭俊利 谷雨 张震 耿小航 PENG Junli;GU Yu;ZHANG Zhen;GENG Xiaohang(National Defense Key Discipline Laboratory of Communication Information Transmission and Fusion Technology,Hangzhou Dianzi University,Hangzhou 310000,China)
出处 《计算机工程》 CAS CSCD 北大核心 2021年第4期62-67,共6页 Computer Engineering
基金 国家自然科学基金(61673146)。
关键词 单词贡献度 Word2Vec词向量 词嵌入 文档表示 文本分类 Term Contribution(TC) Word2Vec word vector word embedding document representation text classification
  • 相关文献

参考文献12

二级参考文献57

共引文献200

同被引文献188

引证文献16

二级引证文献81

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部