摘要
构建双语平行语料库是提升低资源语言机器翻译质量的一种有效方法。该文提出了基于CNN-CorrNet网络的汉缅平行句对抽取方法。具体而言,该文首先利用BERT得到汉语、缅语词向量表征,并将汉语、缅语两种语言句子用卷积神经网络进行句子表征,以捕捉句子重要特征信息;然后为了保证两种语言跨语言表征的最大相关性,利用已有的汉缅平行句对作为约束条件,使用CorrNet(相关神经网络)将汉缅的句子表征投影到公共语义空间;最后计算公共语义空间中汉语、缅语句子距离,并根据距离判断汉—缅双语句子是否为平行句子。实验结果表明,相比最大熵模型、孪生网络模型,该文提出的方法F1值分别提升了13.3%、5.1%。
Bilingual parallel corpus is a key resources to improve the quality of machine translation.We propose a Chinese-Burmese parallel sentence pair extraction method based on CNN-CorrNet network.Specifically,we first use BERT to obtain vector representations of Chinese and Burmese words,and use convolution neural network to represent sentences in Chinese and Burmese to capture important feature information of sentences.Then,in order to ensure the maximum correlation between the cross-language representations of the two languages,the existing Chinese and Burmese parallel sentence pairs are used as constraints,and CorrNet(Correlational Neural Networks)is applied to map the Chinese and Burmese sentence representation into the common semantic space.Finally,the distance of Chinese and Burmese sentences in the public semantic space is calculated to determine the true bilingual sentence pairs.The experiment results show that,compared with the maximum entropy model and the siamese network model,the F1 value of the method proposed in this paper is increased by 13.3%or 5.1%,respectively.
作者
毛存礼
吴霞
朱俊国
余正涛
李云龙
王振晗
MAO Cunli;WU Xia;ZHU Junguo;YU Zhengtao;LI Yunlong;WANG Zhenhan(Faculty of Information Engineering and Automation,Kunming University of Science and Technology,Kunming,Yunnan 650500,China;Yunnan Key Laboratory of Artificial Intelligence,Kunming University of Science and Technology,Kunming,Yunnan 650500,China)
出处
《中文信息学报》
CSCD
北大核心
2020年第11期60-66,共7页
Journal of Chinese Information Processing
基金
国家自然科学基金(61732005,61662041,61761026,61866019,61972186)
云南省应用基础研究计划重点项目(2019FA023)
云南省中青年学术和技术带头人后备人才项目(2019HB006)。
关键词
汉缅双语
平行句对
卷积神经网络
相关神经网络
公共语义空间
Chinese-Burmese bilingual
parallel sentence pair
CNN
correlational neural networks
common semantic space