摘要
常见的词嵌入向量模型存在每个词只具有一个词向量的问题,词的主题值是重要的多义性条件,可以作为获得多原型词向量的附加信息。在skip-gram(cbow)模型和文本主题结构基础上,该文研究了两种改进的多原型词向量方法和基于词与主题的嵌入向量表示的文本生成结构。该模型通过联合训练,能同时获得文本主题、词和主题的嵌入向量,实现了使用词的主题信息获得多原型词向量,和使用词和主题的嵌入式向量学习文本主题。实验表明,该文提出的方法不仅能够获得具有上下文语义的多原型词向量,也可以获得关联性更强的文本主题。
Most models of word embedding assign each word with only one vector representation. The polysemy word embedding can be improved through the external information such as the topics of words. Based on the original skip-gram(cbow) and topic model, this paper designs two representation methods of multi-prototype word embedding and one method of text generation via word embedding. The joint learning approach is employed to simultaneously generate the topic information, the word embedding and the topic embedding, leveraging the multi prototype word vector and the document topic for each other. Experiments show that the proposed method can obtain different semantic vector of polysemy words and more coherence topics.
作者
曹中华
夏家莉
彭文忠
张志斌
CAO Zhonghua;XIA Jiali;PENG Wenzhong;ZHANG Zhibin(School of Information Technology,Big Data Center of Finance,Jiangxi University of Finance and Economics,Nanchang,Jiangxi 330032,China;School of Software,Jiangxi Normal University,Nanchang,Jiangxi 330022,China)
出处
《中文信息学报》
CSCD
北大核心
2020年第3期64-71,106,共9页
Journal of Chinese Information Processing
基金
国家自然科学基金(41661083)。
关键词
多原型词向量
多义词
主题模型
神经网络
multi-prototype words embedding
polysemy words
topic model
neural network