期刊文献+

基于指针网络融入混淆集知识的中文语法纠错 被引量:2

Incorporating Confusion Set Knowledge with Pointer Network for Chinese Grammatical Error Correction
下载PDF
导出
摘要 在中文语法纠错(CGEC)任务上,虽然替换类错误在数据集中占比最多,但还没有研究者尝试过将音近和形近知识融入基于神经网络的语法纠错模型中。针对这一问题,该文做了两方面的尝试。首先,该文提出了一种基于指针网络融入混淆集知识的语法纠错模型。具体而言,该模型在序列到编辑(Seq2Edit)语法纠错模型基础上,利用指针网络融入汉字之间的音近和形近知识。其次,在训练数据预处理阶段,即从错误-正确句对抽取编辑序列过程中,该文提出一种混淆集指导的编辑距离算法,从而更好地抽取音近和形近字的替换类编辑。实验结果表明,该文提出的两点改进均能提高模型性能,且作用互补;该文所提出的模型在NLPCC 2018评测数据集上达到了目前最优性能。实验分析表明,与基准Seq2Edit语法纠错模型相比,该文模型的性能提升大部分来自于替换类错误的纠正。 For Chinese Grammatical Error Correction(CGEC)task,although substitution errors account for the largest proportion of all the errors in the data set,no researcher has tried to incorporate phonological and visual similarity knowledge into the neural network-based GEC model.To tackle this problem,the article makes two attempts.First,this paper proposes a GEC model which incorporates with the confusion set knowledge based on the pointer network.Specifically,this model is Seq2Edit-based GEC model and use the pointer network to incorporate phonological and visual similarity knowledge.Second,during the training data pre-process stage,i.e.,in the process of extracting edit sequences from wrong-correct sentence pairs,this paper proposes a confusion set guided edit distance algorithm to better extract substitution edit of phonological and visual similarity characters.The experimental results show that the two proposed methods can both improve the performance of the model and can provide complementary contributions;and the proposed model achieves the current state-of-the-art results in the NLPCC 2018 evaluation data set.Experimental analysis shows that compared with the baseline Seq2Edit GEC model,the overall performance gain of our proposed model is mostly contributed by correction of substitution errors.
作者 李嘉诚 沈嘉钰 龚晨 李正华 张民 LI Jiacheng;SHEN Jiayu;GONG Chen;LI Zhenghua;ZHANG Min(School of Lanpyter Science and Technology,Soockow University,Suzhou,Jiangsu 215006,China)
出处 《中文信息学报》 CSCD 北大核心 2022年第4期29-38,共10页 Journal of Chinese Information Processing
基金 国家自然科学基金(62176173,61876116)。
关键词 语法纠错 混淆集 指针网络 grammatical error correction confusion set pointer network
  • 相关文献

参考文献1

二级参考文献3

共引文献25

同被引文献22

引证文献2

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部