摘要
神经机器翻译在资源丰富语言对中取得良好性能,但这种性能的取得通常以大规模的平行语料为前提。在民族语言与汉语之间仅存在小规模双语平行句对的情况下,该文提出把机器翻译中的数据增强技术融入多任务学习框架提升翻译性能。首先,通过对目标端句子进行简单的变换(如词序调整、词替换等)以产生非准确的新句子增强噪声;其次,将上述扩增的伪平行语料作为辅助任务融入一个多任务学习框架中以充分训练编码器,并使神经网络将注意力转移到如何使编码器中的源语言句子拥有更丰富准确的表示。通过在全国机器翻译大会(CCMT 2021)蒙汉、藏汉以及维汉3种机器翻译评测数据集上进行6个方向的互译实验,结果表明,在上述民汉翻译任务上,该文方法均显著优于基线系统及多种常见的机器翻译数据增强方法。
Neural machine translation achieves good performance in language pairs with a large parallel corpus.To deal with the fact that small bilingual parallel sentence pairs between minority langurages and Chinese,this paper proposes to implement the data augmentation into a multi-task learning framework.First,the simple transformations are performed on the target sentence,such as word order adjustment,word substitution,to produce new sentence pairs.Second,the above augmented pseudo-parallel corpus are introduced as auxiliary tasks into a multi-task learning framework to fully train the encoder,and masking the neural network pay its attertion to how to generate a richer and more accurate representation of the source language sentences in the encoder.Experiments on the CCMT 2021 dataset of Mongolian-Chinese,Tibetan-Chinese,Uyghur-Chinese,and the reverse direction show consistent improvements over the common data augmentation methods in machine translation.
作者
申影利
周毛克
赵小兵
SHEN Yingli;ZHOU Maoke;ZHAO Xiaobing(School of Chinese Ethnic Minority Languages and Literatures,MinzuUniversity of China,Beijing 100081,China;National Language Resource Monitoring and Research Center of Minority Languages,Beijing 100081,China;School of Information Engineering,Minzu University of China,Beijing 100081,China)
出处
《中文信息学报》
CSCD
北大核心
2023年第2期97-106,共10页
Journal of Chinese Information Processing
基金
国家语委重点项目(ZDI135-118)
中央民族大学国家安全研究专项项目(2022GJAQ03)
中央民族大学研究生科研实践项目(BZKY2021062)。
关键词
多任务学习
数据增强
低资源机器翻译
multi-task learning
data augmentation
low-resource machine translation