摘要
短文本因其文本较短和文本特征稀疏,导致信息量少且抗噪能力弱,因此短文本分类问题面临着巨大挑战。提出了词嵌入、词相似度和词重要性共同作用的文本表示方法,在Word2vec上引入词语相似度,并通过相似度与词频-逆文本频率(TF-IDF)的乘积来表示扩展词语对短文本的影响因子,并构造短文本向量,从而实现短文本分类。试验结果表明,该方法的分类准确率优于传统的词袋法、直接采用Word2vec训练的词向量、词向量不加权直接扩展和仅考虑TF-IDF加权扩展等方法。
Being short and sparse,a short text has less information and low anti-noise ability.Therefore,the classification of short text faces with great challenges.A text representation method based on the word embedding,the word similarity and the word importance is proposed.The Word2 vec is used to introduce similarity among words.The product of similarity and the term frequency-inverse document frequency(TF-IDF)is used to represent the influence factor of extended words on short text,and a short text vector is constructed to prepare for classification.Experimental results show that the classification accuracy of the proposed method is better than that of the traditional bag-of-words,the word embedding,the extention of word embedding,and the weighted extention of word embedding wethods.
作者
卢俊宇
周翔翔
LU Junyu;ZHOU Xiangxiang(Unit 95784 of PLA,Leshan 614100,Sichuan,China)
出处
《指挥信息系统与技术》
2020年第4期70-73,共4页
Command Information System and Technology
基金
陕西省自然科学基础研究计划(2017JM6062)资助项目。
关键词
Word2vec
词频-逆文本频率
相似度
文本表示
特征扩展
Word2vec
term frequency-inverse document frequency(TF-IDF)
similarity
text representation
feature extension