基于RoBERTa-ND的中文实词辨析

Chinese Notional Word Discrimination Based on RoBERTa-ND

下载PDF

导出

摘要在机器阅读理解任务中,由于中文实词的组合性和隐喻性,且缺乏有关中文实词辨析的数据集,因此传统方法对中文实词的理解程度和辨析能力仍然有限.为此,构建了一个大规模(600k)的中文实词辨析数据集(Chinese notional word discrimination cloze data set,CND).在数据集中,一句话中的一个实词被替换成了空白占位符,需要从提供的两个候选实词中选择正确答案.设计了一个基线模型RoBERTa-ND(RoBERTa-based notional word discrimination model)来对候选词进行选择.模型首先利用预训练语言模型提取语境中的语义信息.其次,融合候选实词语义并通过分类任务计算候选词得分.最后,通过增强模型对位置及方向信息的感知,进一步加强了模型的中文实词的辨析能力.实验表明,该模型在CND上准确率达到90.21%,战胜了DUMA(87.59%),GNN-QA(84.23%)等主流的完形填空模型.该工作填补了中文隐喻语义理解研究的空白,可以在提高中文对话机器人认知能力等方向开发更多实用价值.数据集CND及RoBERTa-ND代码均已开源:https://github.com/2572926348/CND-Large-scale-Chinese-National-word-discrimination-dataset. Chinese notional words are combinatorial and metaphorical in nature,and there is a lack of data sets on Chinese notional word discrimination.As a result,the understanding and discriminative capability of traditional methods for Chinese notional words are still limited in machine reading comprehension tasks.For this reason,a large-scale(600k)Chinese notional word discrimination cloze data set(CND)is constructed.In the dataset,a notional word in a sentence is replaced with a blank placeholder,and the correct answer needs to be selected from the two candidate notional words provided.A baseline model,RoBERTa-based notional word discrimination model(RoBERTa-ND),is designed to select candidate words.The model first extracts semantic information in the context using a pre-trained language model.Second,the semantics of candidate notional words are fused,and the scores of candidate words are computed by a classification task.Finally,the model’s ability to discriminate Chinese notional words is further enhanced by enhancing the model’s perception of locations and orientation information.Experiments show that the model achieves the accuracy of 90.21%on CND,beating mainstream cloze test models such as DUMA(87.59%)and GNN-QA(84.23%).This work fills the gap in the research on Chinese metaphorical semantic understanding and can develop more practical value in improving the cognitive ability of Chinese Quiz Bot.The codes of CND and RoBERTa-ND are open-source:https://github.com/2572926348/CND-Large-scale-Chinese-National-word-discrimination-dataset.

作者孙晨瑜王振琦张宝宇张卫山侯召祥陈涛 SUN Chen-Yu;WANG Zhen-Qi;ZHANG Bao-Yu;ZHANG Wei-Shan;HOU Zhao-Xiang;CHEN Tao(College of Computer Science and Technology,China University of Petroleum,Qingdao 266580,China)

机构地区中国石油大学(华东)计算机科学与技术学院

出处《计算机系统应用》 2023年第5期157-163,共7页 Computer Systems & Applications

基金国家自然科学基金(62072469) 中国科学院自动化研究所复杂系统管理与控制国家重点实验室2021年开放课题(20210114)。

关键词隐喻语义理解中文实词辨析机器阅读理解 metaphorical semantic understanding Chinese notional word discrimination machine reading comprehension

分类号 TP391.1 [自动化与计算机技术—计算机应用技术]