期刊文献+

细粒度语义知识图谱增强的中文OOV词嵌入学习 被引量:1

Fine-grained Semantic Knowledge Graph Enhanced Chinese OOV Word Embedding Learning
下载PDF
导出
摘要 随着信息化领域的范围不断扩大,许多特定领域的文本语料开始涌现。这些特定领域,如医疗、通信等,由于受到安全性和敏感性的影响,其数据规模通常较小,传统的词嵌入学习模型难以获得有效的结果。另一方面,直接应用现有的预训练语言模型时会出现较多未登录词,这些词汇无法表示成向量,从而影响下游任务的性能表现。许多学者开始研究如何利用细粒度语义信息来得到较高质量的未登录词向量表示。然而,当前的未登录词嵌入学习模型大多针对英文语料,对中文词的细粒度语义信息只能进行简单的拼接或映射,难以在中文未登录词嵌入学习任务中得到有效的向量表示。针对上述问题,首先通过中文构字规则,即中文词所包含的汉字、汉字所包含的部件和拼音等,构建细粒度的知识图谱,使其不仅能涵盖汉字和单词之间的关联关系,还能对拼音和汉字、组件和汉字等细粒度语义信息之间的多元且复杂的关联关系进行表征。然后,在知识图谱上运行图卷积算法,从而对中文词的细粒度语义信息之间以及它们与词语义之间更深层次的关系进行建模。此外,文中通过在子图结构上构建图读出来进一步挖掘细粒度语义信息与词语义信息之间的组成关系,据此提升模型在未登录词嵌入推断中的精准度。实验结果表明,在面对未登录词占比较大的特定语料上的词配对、词相似任务,以及文本分类、命名实体识别等下游任务时,所提模型都取得了更好的性能。 With the expansion of the scope in informatization fields,lots of text corpora in specific fields continue to appear.Due to the impact of security and sensitivity,the text corpora in these specific fields(e.g.,medical records corpora and communication corpora)are often small-scaled.It is difficult for traditional word embedding learning methods to obtain high-quality embeddings on these corpora.On the other hand,there may exist many out-of-vocabulary words in these corpora when using the existing pre-training language models directly,for which,many words cannot be represented as vectors and the performance on downstream tasks are limited.Many researchers start to study how to infer the semantics of out-of-vocabulary words and obtain effective out-of-vocabulary word embeddings based on fine-grained semantic information.However,the current models utilizing fine-grained semantic information mainly focus on the English corpora and they only model the relationship among fine-grained semantic information by simple ways of concatenation or mapping,which leads to a poor model robustness.Aiming at addressing the above problems,this paper first proposes to construct a fine-grained knowledge graph by exploiting Chinese word formation rules,such as the characters contained in Chinese words,as well as the character components and pinyin of Chinese characters.The know-ledge graph not only captures the relationship between Chinese characters and Chinese words,but also represents the multiple and complex relationships between Pinyin and Chinese characters,components and Chinese characters,and other fine-grained semantic information.Next,the relational graph convolution operation is performed on the knowledge graph to model the deeper relationship between fine-grained semantics and word semantics.The method further mines the relationship between fine-grained semantics by the sub-graph readout,so as to effectively infer the semantic information of Chinese out-of-vocabulary words.Experimental results show that our model achieves better performance on specific corpora with a large proportion of out-of-vocabulary words when applying to tasks such as word analogy,word similarity,text classification,and named entity recognition.
作者 陈姝睿 梁子然 饶洋辉 CHEN Shurui;LIANG Ziran;RAO Yanghui(School of Computer Science and Engineering,Sun Yat-sen University,Guangzhou 510006,China)
出处 《计算机科学》 CSCD 北大核心 2023年第3期72-82,共11页 Computer Science
基金 国家自然科学基金面上项目(61972426)。
关键词 未登录词嵌入学习 中文细粒度语义信息 细粒度知识图谱 图卷积网络学习 Out-of-vocabulary word embedding learning Chinese fine-grained semantic information Fine-grained knowledge graph Graph convolution network learning
  • 相关文献

参考文献5

二级参考文献23

共引文献13

同被引文献4

引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部