摘要
回译作为翻译中重要的数据增强方法,受到了越来越多研究者的关注。其基本思想为首先基于平行语料训练基础翻译模型,然后利用模型将单语语料翻译为目标语言,组合为新语料用于模型训练。然而在汉-越低资源场景下,训练得到的基础翻译模型性能较差,导致在其上应用回译方法得到的平行语料中含有较多噪声,较难用于下游任务。针对此问题,构建基于比例抽取的孪生网络筛选模型,通过训练使得模型可以识别平行句对和伪平行句对,在同一语义空间上对回译得到的伪平行语料进行筛选去噪,进而得到更优的平行语料。在汉越数据集上的实验结果表明,所提方法训练的模型的性能相较基线模型有显著提升。
As an important data enhancement method in translation,back translation has attracted more and more researchers’attentions.The basic idea is to first train a basic translation model based on parallel corpus,then use the model to translate monolingual corpus into the target language,and combine it into a new corpus for model training.However,in the Chinese-Vietnamese low-resource scenario,the performance of the basic translation model obtained by training is poor,which results in the parallel corpus obtained by applying the back translation method on it contains more noise and is difficult to use for downstream tasks.In response to this problem,a siamese network screening model based on proportional extraction is constructed.Through training,the model can identify parallel sentence pairs and pseudo-parallel sentence pairs,and filter and denoise the pseudo-parallel corpus obtained by back translation in the same semantic space,thereby obtaining a better parallel corpus.The test results on the Chinese-Vietnamese data set show that the proposed method significantly outperforms the baseline system.
作者
王可超
郭军军
张亚飞
高盛祥
余正涛
WANG Ke-chao;GUO Jun-jun;ZHANG Ya-fei;GAO Sheng-xiang;YU Zheng-tao(Faculty of Information Engineering and Automation,Kunming University of Science and Technology,Kunming 650500;Yunnan Key Laboratory of Artificial Intelligence,Kunming University of Science and Technology,Kunming 650500,China)
出处
《计算机工程与科学》
CSCD
北大核心
2022年第10期1861-1868,共8页
Computer Engineering & Science
基金
国家自然科学基金(61732005,61761026,61866020,61672271,61762056,61972186)
国家重点研发计划(2019QY1801,2019QY1802,2019QY1800)。
关键词
汉越平行语料扩充
回译
数据增强
比例抽取
孪生网络
Chinese-Vietnamese parallel corpus expansion
back translation
data enhancement
proportional extraction
siamese network