期刊文献+

基于计数模型的Word Embedding算法

Word embedding algorithm based on count-based models
下载PDF
导出
摘要 Word Embedding是当今非常流行的用于文本处理任务的一种技术。基于计数模型的Word Embedding相比预测模型具有简单、快捷、易训练、善于捕捉词语相似性等优势。基于计数模型,选取2种上下文环境,运用2种权重计算方法和2种相似度计算方法,构建了5种Word Embedding模型。在词语相似性任务上比较和分析了5种Word Embedding模型,发现采用降维策略后的词表达效果要优于降维前的词表达效果;5种模型中,选取窗口上下文,PMI权重计算方法和余弦相似度计算方法的Word Embedding模型在词语相似性任务上表现最为出色。将5种模型和基于预测的Skip-gram模型进行了对比,结果表明在选取训练向量维度为100维时,基于计数的大部分模型在词语相似性任务上可以达到和Skip-gram一样甚至更好的性能。 Currently word embedding is a very popular technology for text processing tasks.Compared to the predictive model,word embedding based on Count-based models has the advantages such as simple,fast,easy to train and good at capturing word similarity.In this paper,by applying count-based models,two kinds of contexts were selected,and two weight calculation methods and two similarity calculation methods were used to construct five word embedding models.These models were compared and analyzed on word similarity task,and it was found that dimensionality reduction could lead to better performance.Among the five models,the word embedding model using window context,PMI weight calculation method and cosine similarity calculation method performs the best.We also compared these five models with the Skip-gram model based on prediction,and the results showed that most of the count-based models can provide the same or better performance on word similarity task when the training vector was 100 dimensions.
出处 《沈阳航空航天大学学报》 2017年第2期66-72,共7页 Journal of Shenyang Aerospace University
基金 国家科技支撑计划项目(项目编号:2015BAH20F01)
关键词 词表达 计数模型 分布式词表达 词语相似性 word representations count-based models word embedding word similarities
  • 相关文献

参考文献4

二级参考文献48

共引文献256

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部