摘要
深度学习方法凭借对语义的深度理解能力在机器翻译领域取得长足的进步.然而,对于低资源语言,大规模双语语料的缺乏易导致模型过拟合.针对低资源神经机器翻译数据稀疏的问题,提出了一种迭代知识精炼的对偶学习训练方法,利用回译扩充双语平行语料,通过迭代调整伪语料和真实语料比例,在学习语言表征的同时降低噪声风险,最后结合译文质量及流利度奖励,在源语-目标语和目标语-源语两个方向上优化模型参数,从而达到提升译文质量的目的.在第15届全国机器翻译大会(CCMT 2019)蒙古语-汉语翻译任务上进行了多项实验,结果表明本文方法相比基线提高显著,充分证明该方法的有效性.
Deep learning has made great progress in the field of machine translation with its deep understanding of semantics.However,for low-resource languages,the lack of large-scale bilingual corpus leads to overfitting of models.Aiming at the problem of sparse data in low-resource neural machine translation,we propose a dual-learning training method based on iterative knowledge refining.Using back translation to expand bilingual parallel corpus,we have adjusted the proportion of pseudo corpus and real corpus iteratively to reduce the noise risk while learning language representation.Finally,combining the translation quality and fluency rewards to optimize model parameters in two directions,source-target and target-source,as well as to achieve the purpose of improving translation quality.We have conducted a number of experiments on the 15th China Conference on Machine Translation(CWMT 2019)Mongolian-Chinese translation task.Results show that the proposed method has secured a significant improvement compared with the baselines,fully proving the effectiveness of the method.
作者
孙硕
侯宏旭
乌尼尔
常鑫
贾晓宁
李浩然
SUN Shuo;HOU Hongxu;WU Nier;CHANG Xin;JIA Xiaoning;LI Haoran(College of Computer Science,Mongolia University,Huhhot 010020,China)
出处
《厦门大学学报(自然科学版)》
CAS
CSCD
北大核心
2021年第4期687-692,共6页
Journal of Xiamen University:Natural Science
基金
内蒙古自治区科技成果转化项目(2019CG028)
内蒙古自治区自然科学基金(2018MS06005)。
关键词
神经机器翻译
低资源语言
对偶学习
回译
知识精炼
neural machine translation
low-resource language
dual-learning
back translation
knowledge refining