摘要
针对短文本中固有的特征稀疏以及传统分类模型存在的"词汇鸿沟"等问题,我们利用Word2Vec模型可以有效缓解短文本中数据特征稀疏的问题,并且引入传统文本分类模型中不具有的语义关系.但进一步发现单纯利用Word2Vec模型忽略了不同词性的词语对短文本的影响力,因此引入词性改进特征权重计算方法,将词性对文本分类的贡献度嵌入到传统的TF-IDF算法中计算短文本中词的权重,并结合Word2Vec词向量生成短文本向量,最后利用SVM实现短文本分类.在复旦大学中文文本分类语料库上的实验结果验证了该方法的有效性.
To address the problems such as the inherent sparsity in the short text and the "lexical gap" of traditional classification model, using Word2 Vec model to map words to a spatial vector of low-dimensional real number according to context semantic relations can effectively ease the sparse feature issue of short text. However, further study found that only using Word2 Vec will ignore the influence of different parts of speech on the short text. Therefore, we introduce part of speech to improve the feature weighting approach, in which the contribution of speech is embedded into the traditional TF-IDF algorithm to calculate the weight of the words in the short text, and the vector of short text is generated by combining the word vector of Word2 Vec. Finally, we use the SVM to achieve short text classification. Experimental results on Fudan University Chinese text classification corpus validate the effectiveness of the proposed method.
作者
汪静
罗浪
王德强
WANG Jing;LUO Lang;WANG De-Qiang(School of Computer Science, South-Central University for Nationalities, Wuhan 430074, Chin)
出处
《计算机系统应用》
2018年第5期209-215,共7页
Computer Systems & Applications
基金
赛尔网络下一代互联网技术创新项目(NGII20150106)