摘要
针对移动营销文本中长度偏短、用词多变、语句残缺等问题,提出了在文本表示过程中采用word2vec进行词项加权语义映射的方法。首先在全语料库中采用word2vec训练词向量,对整体词向量进行聚类操作来汇聚相近语义词语形成语义簇特征空间,在文本向量化过程中,将词语与聚类中心的相似度和词语本身权重结合完成特征权值计算,向量化之后的文本采用欧氏距离计算相似度。将该算法应用于移动营销短文本测试集,通过K近邻(KNN)分类实验表明,该方法在分类性能上比基于词统计特征的方法在各类的F1值有平均6%的提升,能够更有效地衡量移动营销类别短文本的相似度。
In this paper, the authors proposed a weighted semantic mapping method based on word2 vec in the short text representation process, aiming at the shortness of text length, the variability of words and the incomplete sentences in mobile marketing text. Firstly, word2 vec was used in the whole corpus to train the word vector, and the whole word vector was clustered to form semantic cluster feature space by similar semantic words. In text vectorization process, feature weights were calculated using similarity between the word and the cluster center integrate with weight of the word itself. The similarity of the text after vectorization was calculated by Euclidean distance. The K Nearest Neighbor( KNN) classification experiments show that this method has a 6% improvement on average F1 value compared to word-based statistical method and is more effective in measuring the short text similarity of mobile marketing.
出处
《计算机应用》
CSCD
北大核心
2017年第A01期292-294,299,共4页
journal of Computer Applications
关键词
移动营销
短文本向量化
相似度计算
word2vec
K近邻
mobile marketing
short text vectorization
similarity calculation
word2vec
K Nearest Neighbor(KNN)