基于ChineseBert的中文拼写纠错方法被引量：1

Chinese spelling correction method based on ChineseBert

下载PDF

导出

摘要中文拼写错误主要集中在拼音相似和字形相似两个方面,而通用的预训练语言模型只考虑文本的语义信息,忽略了中文的拼音和字形特征.最新的中文拼写纠错(Chinese Spelling Correction,CSC)方法在预训练模型的基础上利用额外的网络来融入拼音和字形特征,但和直接微调预训练模型相比,改进的模型没有显著提高模型的性能,因为由小规模拼写任务语料训练的拼音和字形特征,和预训练模型获取的丰富语义特征相比,存在严重的信息不对等现象.将多模态预训练语言模型ChineseBert应用到CSC问题上,由于ChineseBert已将拼音和字形信息放到预训练模型构建阶段,基于ChineseBert的CSC方法不仅无须构建额外的网络,还解决了信息不对等的问题.由于基于预训练模型的CSC方法普遍不能很好地处理连续错误的问题,进一步提出SepSpell方法.首先利用探测网络检测可能错误的字符,再对可能错误的字符保留拼音特征和字形特征,掩码对应的语义信息进行预测,这样能降低预测过程中错误字符带来的干扰,更好地处理连续错误问题.在三个官方评测数据集上进行评估,提出的两个方法都取得了非常不错的结果. Chinese spelling errors mainly focuse on both phonetic and glyph similar.General pretrained language models only consider the semantic information of the text,ignoring the Chinese phonetic and glyph features.The latest Chinese Spelling Correction(CSC)methods incorporate pinyin and glyph features via additional networks on the basis of the pretrained language models.Compared with fine-tuning pretrained model directly,the improved model does not significantly improve the performance of CSC task.Because of the phonetic and glyphic features trained by the small-scale spelling task corpus,there is a serious information asymmetry compared with the rich semantic features obtained by the pre-training model.To betterly solve the information asymmetry,this paper tries to apply the multimodal pre-training language model ChineseBert to the CSC problem.Since ChineseBert combines phonetic and glyph information into the pre-training model building stage,CSC based on ChineseBert not only needn't to build additional networks,but also solve the problem of information asymmetry.The CSC method based on the pretrained model generally cannot deal with continuous errors very well.Therefore,we propose a novel method SepSpell,which firstly uses the probing network to detect potentially incorrect characters,and preserves the phonetic and glyphic features of the characters that may be incorrect to predict the corresponding semantic information of the mask.SepSpell reduces the interference caused by incorrect characters during the prediction process,so as to better handle the problem of continuous errors.Evaluating on three official evaluation datasets prove both methods with very good results.

作者崔凡强继朋朱毅李云 Cui Fan;Qiang Jipeng;Zhu Yi;Li Yun(School of Information Engineering,Yangzhou University,Yangzhou,225127,China)

机构地区扬州大学信息工程学院

出处《南京大学学报（自然科学版）》 CAS CSCD 北大核心 2023年第2期302-312,共11页 Journal of Nanjing University（Natural Science）

基金国家自然科学基金(62076217,61906060) 扬州大学“青蓝工程”。

关键词中文拼写纠错 Bert ChineseBert 多模态语言模型 Chinese Spelling Correction Bert ChineseBert multimodal pretrained modeling

分类号 TP391.1 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

同被引文献1

1祁俊辉,龙华,赖华,毕丹宏.基于字形编码与拼音编码的近似商标辨识算法研究[J].软件导刊,2018,17(6):77-80. 被引量：2

引证文献1

1Jing Li,Dezheng Zhang,Yonghong Xie,Aziguli Wulamu,Yao Zhang.GP‐FMLNet:A feature matrix learning network enhanced by glyph and phonetic information for Chinese sentiment analysis[J].CAAI Transactions on Intelligence Technology,2024,9(4):960-972.

1刘妍.基于大数据背景下财务管理智能化转型的策略[J].中文科技期刊数据库（全文版）经济管理,2023(4):163-165. 被引量：1
2房杰.BIM技术在风电工程施工项目中的应用分析[J].中文科技期刊数据库（引文版）工程技术,2023(5):104-107.
3张泽涛.行政犯违法性认识错误不可避免的司法认定及其处理[J].复印报刊资料（刑事法学）,2022(5):109-121.
4王羽.我国老年人居家社区养老服务支付意愿研究——基于消费者购买决策理论[J].贵阳市委党校学报,2022(5):46-53. 被引量：1
5袁建伟.英语特殊否定结构翻译中的不对等现象研究[J].济源职业技术学院学报,2022,21(4):42-46.
6李娇,葛艳,刘玉鹏.基于改进YOLOv5的昏暗小目标交通标志识别[J].计算机系统应用,2023,32(5):172-179. 被引量：5
7唐浩文,邓刚,宋博.2022年太空攻防和太空服务领域发展综述[J].国际太空,2023(3):20-25.
8姬京彤.聚焦文化意识培养的高校英语阅读教学改革实践[J].湖北经济学院学报（人文社会科学版）,2023,20(3):157-160. 被引量：3
9七嘴八舌[J].快乐青春（经典阅读）（中学生必读）,2023(5):42-42.
10饶珍丹,李英梅,董昊,张彤.多层次过采样集成的不平衡数据缺陷预测模型[J].小型微型计算机系统,2023,44(4):888-896. 被引量：4

南京大学学报（自然科学版）

2023年第2期

浏览历史

内容加载中请稍等...

基于ChineseBert的中文拼写纠错方法被引量：1

同被引文献1

引证文献1

相关作者

相关机构

相关主题

浏览历史

基于ChineseBert的中文拼写纠错方法 被引量：1

同被引文献1

引证文献1

相关作者

相关机构

相关主题

浏览历史

基于ChineseBert的中文拼写纠错方法被引量：1