摘要
为了缓解大型平行语料库稀缺性对机器翻译质量的影响,无监督方法在神经机器翻译领域备受关注,但其在远距离语言对上的翻译表现仍有待提高。因此,文中引入了翻译语言模型(TLM)并提出了Dict-TLM方法。该方法的核心思想是结合单语语料和无监督双语词典训练语言模型。具体而言,模型首先接受源语言句子作为输入,然后,不同于传统TLM只接受平行语料,Dict-TLM模型还接受源语言句子通过无监督双语词典处理后的数据作为输入,在这种输入中,模型将源语言句子中在双语词典中出现的单词替换为相应的目标语言翻译词,重要的是,该方法中的双语词典是无监督获得的。实验表明,Dict-TLM相对于传统无监督机器翻译在中英语言对上提高了3个BLEU分数。
Unsupervised methods,which strives to alleviate the impact of the scarcity of large parallel corpora on the quality of machine translation,have attracted much attention in the field of neural machine translation.However,their translation performances in distant language pairs still need to be improved.Therefore,the translation language model(TLM)is introduced and the Dict-TLM method is proposed.The core idea of this method is to train language models by combining monolingual corpora and unsupervised bilingual dictionaries.Specifically,the model accepts source language sentences and takes them as the input first,and then,unlike the traditional TLM that only accepts parallel corpora,the Dict-TLM model even accepts data from source language sentences processed by unsupervised bilingual dictionaries and takes them as the input.In this input,the proposed model replaces the words that appear in the bilingual dictionary in the source language sentence with the corresponding target language translation words.Importantly,the bilingual dictionary is obtained in an unsupervised manner.The experiment shows that the Dict-TLM improves the BLEU score by 3%in comparison with the traditional unsupervised machine translation in Chinese English language pairs.
作者
黄孟钦
HUANG Mengqin(Faculty of Information Engineering and Automation,Kunming University of Science and Technology,Kunming 650500,China)
出处
《现代电子技术》
北大核心
2024年第7期161-164,共4页
Modern Electronics Technique
关键词
无监督神经机器翻译
远距离语言对
预训练
TLM
双语词典
双语词嵌入
unsupervised neural machine translation
distant language pairs
pre-training
TLM
bilingual dictionary
bilingual word embedding