期刊文献+

基于RoBERTa-wwm-ext与混淆集的中文文本校对模型

Chinese text proofreading model based on RoBERTa-wwm-ext and confusion set
下载PDF
导出
摘要 中文文本自动校对技术是自然语言处理领域中的主要任务之一.针对中文文本中字粒度级别的错误(音似、形似和义似替换错误),提出一种基于RoBERTa-wwm-ext与混淆集的中文文本校对模型.该模型在RoBERTa-wwm-ext结构的基础上,利用transformer结构中的encoder机制读取整段中文文本序列,然后通过softmax函数计算当前字符权重分布来判断该字符是否错误,并在纠错任务中引入混淆集,使用混淆集找到该错字对应的候选字符,最后结合掩码语言模型给出的修改建议,完成文本校对.在SIGHAN2014与SIGHAN2015中文拼写检查数据集上,设计字粒度级别的中文文本校对实验,对比模型性能.实验结果表明,与当前主流的中文文本校对模型相比,该模型的中文文本校对效果表现更佳,文本校对的准确率、召回率、F1值均有所提升. Chinese text automatic proofreading technology is one of the main tasks in the field of natural language processing.Aiming at the errors in the granularity level of Chinese text(sound like,shape like and meaning like replacement errors),a Chinese text proofreading model based on RoBERTa-wwm-ext and confusion set was proposed.Based on the RoBERTa-wwm-ext structure,the model used the encoder mechanism in the transformer structure to read the entire Chinese text sequence,and then calculated the current character weight distribution through the softmax function to determine whether the character was wrong,The confusion set was introduced into the error correction task,and the candidate character corresponding to the wrong word was found by using the confusion set.Finally,the text proofreading was completed by combining the modification suggestions given by the mask language model.On the Chinese spelling check datasets of SIGHAN2014 and SIGHAN2015,we designed Chinese text proofreading experiments at the word granularity level to compare the model performance.The experimental results show that compared with the current mainstream Chinese text proofreading model,the Chinese text proofreading effect of this model is better,and the accuracy,recall and F1 value of text proofreading are improved.
作者 徐久珺 黄国栋 马传香 XU Jiujun;HUANG Guodong;MA Chuanxiang(School of Computer Science and Information Engineering,Hubei University,Wuhan 430062,China;The Key Research Institute of Humanities and Social Sciences in Hubei Province(Research Center of Information Management for Performance Evaluation),Wuhan 430062,China)
出处 《湖北大学学报(自然科学版)》 CAS 2023年第5期712-718,共7页 Journal of Hubei University:Natural Science
基金 国家自然科学基金(62102136)资助。
关键词 自然语言处理 掩码语言模型 RoBERTa-wwm-ext 混淆集 transformer结构 natural language processing mask language model RoBERTa-wwm-ext confusion set transformer structure
  • 相关文献

参考文献3

二级参考文献19

  • 1陈志忠,计算机学报,1991年,14卷,2期
  • 2刘开瑛,自然语言处理,1991年
  • 3孙茂松,中文信息学报,1989年,3卷,4期
  • 4Kukich K. Techniques for automatically correcting words in text. ACM Computing Surveys, 1992, 24(4): 377-439.
  • 5Mays Eric, Damerau F J, Mercer Robert L. Context-based spelling correction. Information Processing and Management, 1991, 27(5): 517-522.
  • 6Golding Andrew R. A Bayesian hybrid method for context-sensitive spelling correction. In Proc. the Third Workshop on Very Large Corpora, MIT, Cambridge, Massachusetts, USA, 1995, pp.39-53.
  • 7Golding Andrew R, Schabes Yves. Combining trigram-based and feature-based methods for context-sensitive spelling correction. In Proc. the 34th Annual Meeting of the Association for Computational Linguistics, Santa Cruz, CA, 1996, pp.71-78.
  • 8Roth Dan, Zelenko Dmitry. Part of speech tagging using a network of linear separators. In Proc. COLING'98,Montreal, Canada, 1998, pp.1136-1142.
  • 9Golding Andrew R. A window-based approach to context-sensitive spelling correction. Machine Learning, February,1999, 34: pp.107-130.
  • 10Golding Andrew R, Roth Dan. Applying window to context-sensitive spelling correction. In Machine Learning:Proceedings of the 13th International Conference, 1996, pp.182-190.

共引文献110

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部