摘要
基于神经网络模型的蒙汉机器翻译严格采用编码器-解码器的序列建模方式,不能有效利用句法信息以及语言的层次结构信息。为将句法结构信息融入蒙汉机器翻译以提高其翻译性能,提出在源语言端采用双编码器,同时对源句和由源句解析而来的句法依存树进行编码;由于蒙汉机器翻译中经常会出现未登录词问题,因此将使用字节对编码技术预处理蒙古语。为解决机器翻译中的过度矫正问题,在训练阶段,模型以一定的概率从正确标注的序列中和预测生成的序列中采样上下文单词。在120万蒙汉平行语料的实验中证明,该方法相较于传统的BiRNN和CNN,BLEU值分别提高了2.69和2.09。
Mongolian and Chinese machine translation based on neural network model strictly adopts encoder-decoder sequence modeling,which can not effectively use syntactic information and language hierarchy information.In order to integrate syntactic structure information into Mongolian-Chinese machine translation to improve its translation performance,this paper proposed to use a dual encoder on the source language side to encode the source sentence and the syntactic dependency tree derived from the source sentence at the same time.Due to the frequent occurrence of unregistered words in Mongolian Chinese machine translation,byte pair encoding technology was used to preprocess Mongolian language.In order to solve the problem of over-correction in machine translation,in the training phase,the model sampled context words from the correctly labeled sequence and the predicted sequence with a certain probability.Experiments on 1.2 million Mongolian-Chinese parallel corpus prove that compared with the traditional BiRNN and CNN,the BLEU value of the proposed method increased by 2.69 and 2.09 respectively.
作者
薛媛
苏依拉
仁庆道尔吉
石宝
李雷孝
Xue Yuan;Su Yila;Ren Qingdaoerji;Shi Bao;Li Leixiao(College of Information Engineering,Inner Mongolia University of Technology,Hohhot 010080,Inner Mongolia,China)
出处
《计算机应用与软件》
北大核心
2023年第10期70-75,89,共7页
Computer Applications and Software
基金
国家自然科学基金项目(61966028,61966027)。
关键词
依存句法树
图卷积编码
字节对编码
蒙汉机器翻译
Dependency-syntax tree
Graph convolutional encoder
Byte pair encoder
Mongolian-Chinese machine translation