基于RoBERTa-wwm-ext与混淆集的中文文本校对模型

Chinese text proofreading model based on RoBERTa-wwm-ext and confusion set

下载PDF

导出

摘要中文文本自动校对技术是自然语言处理领域中的主要任务之一.针对中文文本中字粒度级别的错误(音似、形似和义似替换错误),提出一种基于RoBERTa-wwm-ext与混淆集的中文文本校对模型.该模型在RoBERTa-wwm-ext结构的基础上,利用transformer结构中的encoder机制读取整段中文文本序列,然后通过softmax函数计算当前字符权重分布来判断该字符是否错误,并在纠错任务中引入混淆集,使用混淆集找到该错字对应的候选字符,最后结合掩码语言模型给出的修改建议,完成文本校对.在SIGHAN2014与SIGHAN2015中文拼写检查数据集上,设计字粒度级别的中文文本校对实验,对比模型性能.实验结果表明,与当前主流的中文文本校对模型相比,该模型的中文文本校对效果表现更佳,文本校对的准确率、召回率、F1值均有所提升. Chinese text automatic proofreading technology is one of the main tasks in the field of natural language processing.Aiming at the errors in the granularity level of Chinese text(sound like,shape like and meaning like replacement errors),a Chinese text proofreading model based on RoBERTa-wwm-ext and confusion set was proposed.Based on the RoBERTa-wwm-ext structure,the model used the encoder mechanism in the transformer structure to read the entire Chinese text sequence,and then calculated the current character weight distribution through the softmax function to determine whether the character was wrong,The confusion set was introduced into the error correction task,and the candidate character corresponding to the wrong word was found by using the confusion set.Finally,the text proofreading was completed by combining the modification suggestions given by the mask language model.On the Chinese spelling check datasets of SIGHAN2014 and SIGHAN2015,we designed Chinese text proofreading experiments at the word granularity level to compare the model performance.The experimental results show that compared with the current mainstream Chinese text proofreading model,the Chinese text proofreading effect of this model is better,and the accuracy,recall and F1 value of text proofreading are improved.

作者徐久珺黄国栋马传香 XU Jiujun;HUANG Guodong;MA Chuanxiang(School of Computer Science and Information Engineering,Hubei University,Wuhan 430062,China;The Key Research Institute of Humanities and Social Sciences in Hubei Province(Research Center of Information Management for Performance Evaluation),Wuhan 430062,China)

机构地区湖北大学计算机与信息工程学院湖北省高校人文社科重点研究基地(绩效评价信息管理研究中心)

出处《湖北大学学报（自然科学版）》 CAS 2023年第5期712-718,共7页 Journal of Hubei University：Natural Science

基金国家自然科学基金(62102136)资助。

关键词自然语言处理掩码语言模型 RoBERTa-wwm-ext 混淆集 transformer结构 natural language processing mask language model RoBERTa-wwm-ext confusion set transformer structure

分类号 TB324.1 [一般工业技术—材料科学与工程]

引文网络
相关文献

参考文献3

1易蓉湘,何克抗.计算机汉语文稿校对系统[J].计算机研究与发展,1997,34(5):346-350. 被引量：12
2李建华,王晓龙.Combining Trigram and Automatic Weight Distribution in Chinese Spelling Error Correction[J].Journal of Computer Science & Technology,2002,17(6):915-923. 被引量：4
3李舟军,范宇,吴贤杰.面向自然语言处理的预训练技术研究综述[J].计算机科学,2020,47(3):162-173. 被引量：99

二级参考文献19

1陈志忠，计算机学报，1991年，14卷，2期
2刘开瑛，自然语言处理，1991年
3孙茂松，中文信息学报，1989年，3卷，4期
4Kukich K. Techniques for automatically correcting words in text. ACM Computing Surveys, 1992, 24(4): 377-439.
5Mays Eric, Damerau F J, Mercer Robert L. Context-based spelling correction. Information Processing and Management, 1991, 27(5): 517-522.
6Golding Andrew R. A Bayesian hybrid method for context-sensitive spelling correction. In Proc. the Third Workshop on Very Large Corpora, MIT, Cambridge, Massachusetts, USA, 1995, pp.39-53.
7Golding Andrew R, Schabes Yves. Combining trigram-based and feature-based methods for context-sensitive spelling correction. In Proc. the 34th Annual Meeting of the Association for Computational Linguistics, Santa Cruz, CA, 1996, pp.71-78.
8Roth Dan, Zelenko Dmitry. Part of speech tagging using a network of linear separators. In Proc. COLING'98,Montreal, Canada, 1998, pp.1136-1142.
9Golding Andrew R. A window-based approach to context-sensitive spelling correction. Machine Learning, February,1999, 34: pp.107-130.
10Golding Andrew R, Roth Dan. Applying window to context-sensitive spelling correction. In Machine Learning:Proceedings of the 13th International Conference, 1996, pp.182-190.

共引文献110

1陈业明,戴齐,刘捷.融合字位置特征的铁路事故命名实体识别[J].计算机系统应用,2022,31(12):211-219. 被引量：3
2丁美荣,冯伟森,黄荣翔,罗嘉俊.基于预训练模型和基础词典扩展的酒店评论情感分析[J].计算机系统应用,2022,31(11):296-308. 被引量：3
3郑智泉,杨楠.智能革命下数据驱动的智慧图书馆建设分析[J].智能计算机与应用,2020(8):183-185.
4卢洪.基于深度学习聚类算法的城市数据分类分级方法[J].工业技术创新,2021,8(4):73-78. 被引量：3
5张仰森,曹元大,徐波.中文文本自动校错系统中知识库及其构造方法研究[J].小型微型计算机系统,2004,25(12):2237-2242. 被引量：3
6张仰森,俞士汶.文本自动校对技术研究综述[J].计算机应用研究,2006,23(6):8-12. 被引量：39
7刘长松,伍振军,乔春雷,李元祥.用统计方法实现汉字输入的智能联想[J].中文信息学报,2000,14(1):32-38. 被引量：5
8赵卫红.海洋中胶体研究的新进展[J].海洋与湖沼,2000,31(2):221-229. 被引量：11
9刘亮亮,王石,王东升,汪平仄,曹存根.领域问答系统中的文本错误自动发现方法[J].中文信息学报,2013,27(3):77-83. 被引量：19
10王虹,张仰森.基于词二元接续的中文文本自动查错研究[J].贵州大学学报（自然科学版）,2001,18(1):16-21. 被引量：3

1杜晓童,李崭,付萍萍,刘彦君.引入反馈机制的中文文本校对技术研究[J].计算机科学与应用,2023,13(3):390-398.
2唐洁.肾移植术后延续性护理中采用区域化管理的效果探究[J].每周文摘·养老周刊,2023(13):230-232.
3程超男.药护管理对胸外科术后患者自控镇痛用药合理性的影响[J].北方药学,2023,20(6):190-192.
4朱富文,侯志会,李明振.轻量化的多尺度跨通道注意力煤流检测网络[J].工矿自动化,2023,49(8):100-105.
5姬振蒙,任荣荣,朱丽,殷敏,沈业松,李亚芳,孙扣忠.4种苗前除草剂对沿海地区小麦田杂草的防除效果[J].安徽农业科学,2023,51(16):137-139.
6范学星,张慧春,邹义萍,黄玉萍,边黎明.基于多光谱成像与机器学习的植物叶绿素含量反演[J].林业科学,2023,59(7):78-88. 被引量：2

湖北大学学报（自然科学版）

2023年第5期

浏览历史

内容加载中请稍等...

基于RoBERTa-wwm-ext与混淆集的中文文本校对模型

参考文献3

二级参考文献19

共引文献110

相关作者

相关机构

相关主题

浏览历史