摘要
Word Mover’s Distance(WMD)是近年来非常热门的一个计算文本距离的算法,可以较为准确地进行文本相似度测量,被广泛应用于舆情分析,内容分类等。在WMD算法中,最重要的是将词进行词袋化处理,得到300维度的词向量,由于在得到词向量时,词的权重是随机分配的,所以最终得到的相似文本内容正确率不稳定。文章在WMD算法基础上,提取关键词,结合词性分类,给不同词性的词语分配不同的权重,从而进一步优化WMD算法,提高分类的准确率。
Word Mover's Distance is a very popular algorithm in recent years. This algorithm provides a new way to calculate the distance between words and words, so it can be applied in natural language processing such as public opinion processing and social media classification. In the WMD algorithm, the most important thing is to word-pack the words to get the word vectors of300 dimensions. Since the weight of the words is randomly assigned when the word vectors are obtained, the accuracy of the resulting similar text contents is not stable. Based on the WMD algorithm, this dissertation extracts keywords and combines part-of-speech classification to assign different weights to terms of different parts of speech to further optimize the WMD algorithm and improve the classification accuracy.
作者
赵明月
Zhao Mingyue(School of Computer and Information Engineering, Henan University, Kaifeng, Henan475004, Chin)
出处
《计算机时代》
2018年第5期66-70,73,共6页
Computer Era
关键词
词性分类
权重
提取关键词
相似度
part-of-speech classification
weight
extract keyword
similarity