摘要
在神经机器翻译过程中,低频词是影响翻译模型性能的一个关键因素。由于低频词在数据集中出现次数较少,训练经常难以获得准确的低频词表示,该问题在低资源翻译中的影响更为突出。该文提出了一种低频词表示增强的低资源神经机器翻译方法。该方法的核心思想是利用单语数据上下文信息来学习低频词的概率分布,并根据该分布重新计算低频词的词嵌入,然后在所得词嵌入的基础上重新训练Transformer模型,从而有效缓解低频词表示不准确问题。该文分别在汉越和汉蒙两个语言对四个方向上分别进行实验,实验结果表明,该文提出的方法相对于基线模型均有显著的性能提升。
In neural machine translation,the low-frequency word is a key factor affecting the quality of the translation output,which is more prominent in low-resource scenario.This paper proposes a low-resource neural machine translation method with enhanced the representation of low-frequency words.The main idea is to use monolingual data context information to learn the probability distribution of low-frequency words,and recalculate the word embeddings of low-frequency words based on this distribution.The Transformer model is then re-trained by the new word embeddings,thereby effectively alleviating the problem of representing low-frequency words inaccurately.The experimental results in the four directions between Chinese and Vietnamese,Chinese and Mongolian translation tasks show that the method proposed in this paper has a significant improvement over the baseline model.
作者
朱俊国
杨福岸
余正涛
邹翔
张泽锋
ZHU Junguo;YANG Fuan;YU Zhengtao;ZOU Xiang;ZHANG Zefeng(Faculty of Information Engineering and Automation,Kunming University of Science and Technology,Kunming,Yunnan 650500,China;Yunnan Key Laboratory of Artificial Intelligence,Kunming University of Science and Technology,Kunming,Yunan 650500,China)
出处
《中文信息学报》
CSCD
北大核心
2022年第6期44-51,共8页
Journal of Chinese Information Processing
基金
国家自然科学基金(61732005,62166022,61866020)
云南省科技厅面上项目(202101AT076077)
云南省人培项目(KKSY201903018)。
关键词
低频词表示
信息增强
低资源
神经机器翻译
low-frequency word representation
information enhancement
low resources
neural machine translation