摘要
中文文本自动校对技术是自然语言处理领域中的主要任务之一.针对中文文本中字粒度级别的错误(音似、形似和义似替换错误),提出一种基于RoBERTa-wwm-ext与混淆集的中文文本校对模型.该模型在RoBERTa-wwm-ext结构的基础上,利用transformer结构中的encoder机制读取整段中文文本序列,然后通过softmax函数计算当前字符权重分布来判断该字符是否错误,并在纠错任务中引入混淆集,使用混淆集找到该错字对应的候选字符,最后结合掩码语言模型给出的修改建议,完成文本校对.在SIGHAN2014与SIGHAN2015中文拼写检查数据集上,设计字粒度级别的中文文本校对实验,对比模型性能.实验结果表明,与当前主流的中文文本校对模型相比,该模型的中文文本校对效果表现更佳,文本校对的准确率、召回率、F1值均有所提升.
Chinese text automatic proofreading technology is one of the main tasks in the field of natural language processing.Aiming at the errors in the granularity level of Chinese text(sound like,shape like and meaning like replacement errors),a Chinese text proofreading model based on RoBERTa-wwm-ext and confusion set was proposed.Based on the RoBERTa-wwm-ext structure,the model used the encoder mechanism in the transformer structure to read the entire Chinese text sequence,and then calculated the current character weight distribution through the softmax function to determine whether the character was wrong,The confusion set was introduced into the error correction task,and the candidate character corresponding to the wrong word was found by using the confusion set.Finally,the text proofreading was completed by combining the modification suggestions given by the mask language model.On the Chinese spelling check datasets of SIGHAN2014 and SIGHAN2015,we designed Chinese text proofreading experiments at the word granularity level to compare the model performance.The experimental results show that compared with the current mainstream Chinese text proofreading model,the Chinese text proofreading effect of this model is better,and the accuracy,recall and F1 value of text proofreading are improved.
作者
徐久珺
黄国栋
马传香
XU Jiujun;HUANG Guodong;MA Chuanxiang(School of Computer Science and Information Engineering,Hubei University,Wuhan 430062,China;The Key Research Institute of Humanities and Social Sciences in Hubei Province(Research Center of Information Management for Performance Evaluation),Wuhan 430062,China)
出处
《湖北大学学报(自然科学版)》
CAS
2023年第5期712-718,共7页
Journal of Hubei University:Natural Science
基金
国家自然科学基金(62102136)资助。