摘要
基于语料库构建词语语义性向量的词嵌入模型,可以定量刻画词语的上下文语义。然而,传统的词嵌入模型在揭示一词多义词汇的语义时,存在着语义空间向量维度不确定或缺乏直观可解释性等局限,此外,对于词汇表外未登录新词语的语义性嵌入识别,尚缺乏有效的途径。针对一词多义问题和未登录词问题,可将词嵌入的优势和词共现的优势相融合,以弥补传统词嵌入模型的语义空间维度不确定、语义维度不可解释及未登录词忽略等方面的不足。主要创新工作包括:基于训练后的词嵌入矩阵与单词归一化的共现矩阵,构建全局性语料词向量;为未登录词创建语料词向量,并与全局性语料词向量进行权重融合,以提高词嵌入的精确率。通过公开数据集的两项实验结果表明,基于词共现的一词多义及未登录词嵌入模型,可有效提升词嵌入的精确度,并可缩短词嵌入的进程时间。
The word embedding model of word semantic vector based on corpus can quantitatively describe the context semantics of words.However,the traditional word embedding model has some limitations in revealing the semantics of polysemy words,such as uncertain semantic space vector dimension or lack of intuitive interpretability.In addition,there is still a lack of effective way for the semantic embedding recognition of new words that are not registered outside the vocabulary.Aiming at the problem of polysemy and unlisted words,the advantages of word embedding and word co-occurrence can be combined to make up for the shortcomings of the traditional word embedding model,such as uncertain semantic space dimension,unexplainable semantic dimension and ignoring unlisted words.The main innovative work in this paper includes:constructing global corpus word vector based on the trained word embedding matrix and word normalized co-occurrence matrix;creating a corpus word vector for unregistered words and fusing the weight with the global corpus word vector to improve the accuracy of word embedding.Two experiments on public data sets show that the polysemy and unregistered word embedding model based on word co-occurrence can effectively improve the accuracy of word embedding and shorten the process time of word embedding.
作者
李保珍
顾秀莲
LI Bao-zhen;GU Xiu-lian(School of Information Engineering,Nanjing Audit University,Nanjing 211815,China)
出处
《计算机技术与发展》
2022年第12期117-122,共6页
Computer Technology and Development
基金
国家自然科学基金(71673122,72074117)
江苏省社科基金项目(20WTB007)
江苏省研究生科研创新项目(KYCX21_1948)。
关键词
词嵌入
未登录词
多义词
共现矩阵
词向量
word embedding
unknown words
polysemous word
co-occurrence matrix
word vector