期刊文献+

面向未登录词及多义词的共现性词嵌入改进 被引量:1

Co-occurrence Word Embedding Improvement for Unknown and Polysemous Words
下载PDF
导出
摘要 基于语料库构建词语语义性向量的词嵌入模型,可以定量刻画词语的上下文语义。然而,传统的词嵌入模型在揭示一词多义词汇的语义时,存在着语义空间向量维度不确定或缺乏直观可解释性等局限,此外,对于词汇表外未登录新词语的语义性嵌入识别,尚缺乏有效的途径。针对一词多义问题和未登录词问题,可将词嵌入的优势和词共现的优势相融合,以弥补传统词嵌入模型的语义空间维度不确定、语义维度不可解释及未登录词忽略等方面的不足。主要创新工作包括:基于训练后的词嵌入矩阵与单词归一化的共现矩阵,构建全局性语料词向量;为未登录词创建语料词向量,并与全局性语料词向量进行权重融合,以提高词嵌入的精确率。通过公开数据集的两项实验结果表明,基于词共现的一词多义及未登录词嵌入模型,可有效提升词嵌入的精确度,并可缩短词嵌入的进程时间。 The word embedding model of word semantic vector based on corpus can quantitatively describe the context semantics of words.However,the traditional word embedding model has some limitations in revealing the semantics of polysemy words,such as uncertain semantic space vector dimension or lack of intuitive interpretability.In addition,there is still a lack of effective way for the semantic embedding recognition of new words that are not registered outside the vocabulary.Aiming at the problem of polysemy and unlisted words,the advantages of word embedding and word co-occurrence can be combined to make up for the shortcomings of the traditional word embedding model,such as uncertain semantic space dimension,unexplainable semantic dimension and ignoring unlisted words.The main innovative work in this paper includes:constructing global corpus word vector based on the trained word embedding matrix and word normalized co-occurrence matrix;creating a corpus word vector for unregistered words and fusing the weight with the global corpus word vector to improve the accuracy of word embedding.Two experiments on public data sets show that the polysemy and unregistered word embedding model based on word co-occurrence can effectively improve the accuracy of word embedding and shorten the process time of word embedding.
作者 李保珍 顾秀莲 LI Bao-zhen;GU Xiu-lian(School of Information Engineering,Nanjing Audit University,Nanjing 211815,China)
出处 《计算机技术与发展》 2022年第12期117-122,共6页 Computer Technology and Development
基金 国家自然科学基金(71673122,72074117) 江苏省社科基金项目(20WTB007) 江苏省研究生科研创新项目(KYCX21_1948)。
关键词 词嵌入 未登录词 多义词 共现矩阵 词向量 word embedding unknown words polysemous word co-occurrence matrix word vector
  • 相关文献

参考文献11

二级参考文献68

  • 1崔世起,刘群,孟遥,于浩,西野文人.基于大规模语料库的新词检测[J].计算机研究与发展,2006,43(5):927-932. 被引量:32
  • 2张玉芳,彭时名,吕佳.基于文本分类TFIDF方法的改进与应用[J].计算机工程,2006,32(19):76-78. 被引量:121
  • 3熊忠阳,张鹏招,张玉芳.基于χ~2统计的文本分类特征选择方法的研究[J].计算机应用,2008,28(2):513-514. 被引量:44
  • 4Abramo G, D’Angelo C A, Costa F. Identifying Interdisciplinary Through the Disciplinary Classification of Coauthors of Scientific Publications[J]. Journal of the American Society for Information Science and Technology, 2012, 63(11): 2206-2222.
  • 5Jan Van Eck N, Waltman L. Appropritate Similarity Measure for Author Co-citation Analysis[J]. Journal of the American Society for Information Science and Technology, 2008, 59(10): 1653-1661.
  • 6Zhao D, Strotman A.Evolution of Research Activities and Intellectual Influences in Information Science 1996-2005: Introducing Author Bibliographic-coupling Analysis[J]. Journal of the American Society for Information Science and Technology, 2008, 59(13): 2070-2086.
  • 7Morris S A, Yen G G.Crossmaps: Visualization of Overlapping Relationships in Collections of Journal Papers[J]. Proceedings of the National Academy of Sciences, 2004, 101(S1): 5291-5296.).
  • 8Onyancha O B, Ocholla D N.Is HTV/AIDS in Africa Distinct? What Can We Learn from an Analysis of the Literature[J]. Scientometrics, 2009, 79(1): 277-296.
  • 9Mikolov T, Sutskever I, Chen K, et al.Distributed Representations of Words and Phrases and Their Compositionality [C]. In: Proceedings of the Neural Infornational Processing Systems Conference. Nevada, United States: Neural Information Processing Systems Foundation, 2013: 3111-3119.).
  • 10Morin F, Bengio Y.Hierarchical Probabilistic Neural Network Language Model [C]. In: Proceedings of the International Workshop on Artificial Intelligence and Statistics. Cambridge: Cambridge University Press, 2005: 246-252.

共引文献173

同被引文献14

引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部