摘要
中文拼写错误主要集中在拼音相似和字形相似两个方面,而通用的预训练语言模型只考虑文本的语义信息,忽略了中文的拼音和字形特征.最新的中文拼写纠错(Chinese Spelling Correction,CSC)方法在预训练模型的基础上利用额外的网络来融入拼音和字形特征,但和直接微调预训练模型相比,改进的模型没有显著提高模型的性能,因为由小规模拼写任务语料训练的拼音和字形特征,和预训练模型获取的丰富语义特征相比,存在严重的信息不对等现象.将多模态预训练语言模型ChineseBert应用到CSC问题上,由于ChineseBert已将拼音和字形信息放到预训练模型构建阶段,基于ChineseBert的CSC方法不仅无须构建额外的网络,还解决了信息不对等的问题.由于基于预训练模型的CSC方法普遍不能很好地处理连续错误的问题,进一步提出SepSpell方法.首先利用探测网络检测可能错误的字符,再对可能错误的字符保留拼音特征和字形特征,掩码对应的语义信息进行预测,这样能降低预测过程中错误字符带来的干扰,更好地处理连续错误问题.在三个官方评测数据集上进行评估,提出的两个方法都取得了非常不错的结果.
Chinese spelling errors mainly focuse on both phonetic and glyph similar.General pretrained language models only consider the semantic information of the text,ignoring the Chinese phonetic and glyph features.The latest Chinese Spelling Correction(CSC)methods incorporate pinyin and glyph features via additional networks on the basis of the pretrained language models.Compared with fine-tuning pretrained model directly,the improved model does not significantly improve the performance of CSC task.Because of the phonetic and glyphic features trained by the small-scale spelling task corpus,there is a serious information asymmetry compared with the rich semantic features obtained by the pre-training model.To betterly solve the information asymmetry,this paper tries to apply the multimodal pre-training language model ChineseBert to the CSC problem.Since ChineseBert combines phonetic and glyph information into the pre-training model building stage,CSC based on ChineseBert not only needn't to build additional networks,but also solve the problem of information asymmetry.The CSC method based on the pretrained model generally cannot deal with continuous errors very well.Therefore,we propose a novel method SepSpell,which firstly uses the probing network to detect potentially incorrect characters,and preserves the phonetic and glyphic features of the characters that may be incorrect to predict the corresponding semantic information of the mask.SepSpell reduces the interference caused by incorrect characters during the prediction process,so as to better handle the problem of continuous errors.Evaluating on three official evaluation datasets prove both methods with very good results.
作者
崔凡
强继朋
朱毅
李云
Cui Fan;Qiang Jipeng;Zhu Yi;Li Yun(School of Information Engineering,Yangzhou University,Yangzhou,225127,China)
出处
《南京大学学报(自然科学版)》
CAS
CSCD
北大核心
2023年第2期302-312,共11页
Journal of Nanjing University(Natural Science)
基金
国家自然科学基金(62076217,61906060)
扬州大学“青蓝工程”。