期刊文献+

基于词典的低资源神经机器翻译数据增强方法

Dictionary based data augmentation method for low-resource NMT
下载PDF
导出
摘要 数据增强是提升低资源语种上神经机器翻译性能的有效手段,传统的回译方法能够有效利用目标语言的单语数据对模型进行训练,但是由于回译模型的质量与可用的平行语料库大小有关,导致在低资源场景下生成的伪平行语料质量较差。针对以上问题,本文提出了一种基于词典的低资源神经机器翻译数据增强方法,首先从平行语料中抽取词典;其次,在平行语料和目标语言的单语语料中选取合适的模版句子,并对其中的单词进行替换,从而生成伪平行语料以辅助神经机器翻译模型的训练。在公开数据集上的实验证明:使用该数据增强方法处理后的数据集,能够使基线翻译模型获得3.71-6.42的BLEU值提升。 Data augmentation is an effective approach to improve the performance of neural machine translation on low-resource languages.The traditional back-translation method can effectively use the monolingual data of the target language to train the model,but because the quality of the back-translation model is related to the size of the available parallel corpus,the quality of the pseudo-parallel corpus generated in the low-resource scenarios is poor.Aiming at the aboved problems,this paper proposes the dictionary-based low-resource neural machine translation data enhancement method.The method firstly extracts the dictionary from the parallel corpus,and then selects the appropriate template sentence from the parallel corpus and the monolingual corpus of the target language and replace the word(s)in the selected sentence to generate a pseudo-parallel corpus to assist the training of the neural machine translation model.Experiments carried out on public datasets prove that the dataset augmented by the proposed method can improve the baseline translation model by 3.71—6.42 in BLEU value.
作者 张宝兴 ZHANG Baoxing(School of Optical-Electrical and Computer Engineering,University of Shanghai for Science and Technology,Shanghai 200093,China)
出处 《智能计算机与应用》 2024年第3期67-75,共9页 Intelligent Computer and Applications
基金 国家自然科学基金(61772342)。
关键词 低资源语种 神经机器翻译 数据增强 low-resource languages neural machine translation data augmentation

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部