期刊文献+

多策略切分粒度的藏汉双向神经机器翻译研究 被引量:7

Multi-strategic granularity of segmentation on Tibetan-Chinese bidirectional neural machine translation
下载PDF
导出
摘要 现有的机器翻译模型通常在词粒度切分的数据集上进行训练,然而不同的切分粒度蕴含着不同的语法、语义的特征和信息,仅考虑词粒度将制约神经机器翻译系统的高效训练.这对于藏语相关翻译因其语言特点而显得尤为突出.为此提出针对藏汉双向机器翻译的具有音节、词语以及音词融合的多粒度训练方法,并基于现有的注意力机制神经机器翻译框架,在解码器中融入自注意力机制以捕获更多的目标端信息,提出了一种新的神经机器翻译模型.在CWMT2018藏汉双语数据集上的实验结果表明,多粒度训练方法的翻译效果明显优于其余切分粒度的基线系统,同时解码器中引入自注意力机制的神经机器翻译模型能够显著提升翻译效果.此外在WMT2017德英双语数据集上的实验结果进一步证明了该方法在其他语种方向上的适用性. Existing machine translation models are usually trained on word-granularity data sets.However,different segmentations contain different grammatical,semantic features.Segmenting word granularity merely will interfere efficient training of neural machine translation(NMT)models,and is particularly prominent for Tibetan-related translation due to Tibetan linguistic features.Hence,for bidirectional Tibetan-Chinese NMT,we propose a multi-granularity training method focusing on syllables,words and phonetic fusion.We also propose a novel NMT model within the attention-based NMT framework,where a self-attention mechanism is incorporated into the decoder to capture more target-side information.Experimental results on CWMT2018 Tibetan-Chinese bilingual dataset show that the translation performance of the phonetic word fusion segmentation granularity significantly outperforms other segmentation granularity,and that integrating self-attention mechanism into the decoder can improve the translation quality greatly.In this paper,we also use the additional WMT2017 German-English bilingual dataset to demonstrate the universality of the proposed method across different languages.
作者 沙九 冯冲 张天夫 郭宇航 刘芳 SHA Jiu;FENG Chong;ZHANG Tianfu;GUO Yuhang;LIU Fang(Beijing Engineering Research Center of High Volume Language Information Processing and Cloud Computing Applications,School of Computer Science&Technology,Beijing Institute of Technology,Beijing 100081,China;Key Laboratory of Language Engineering and Cognitive Computing,Ministry of Industry and Information Technology,School of Foreign Languages,Beijing Institute of Technology,Beijing 100081,China)
出处 《厦门大学学报(自然科学版)》 CAS CSCD 北大核心 2020年第2期213-219,共7页 Journal of Xiamen University:Natural Science
基金 国家重点研发计划(2016YFB0801200,2018YFC0832104) 国家自然科学基金(U1636203)。
关键词 音词融合 藏汉双向 神经机器翻译 syllable words fusion Tibetan-Chinese bidirectional neural machine translation
  • 相关文献

参考文献2

二级参考文献27

  • 1宋金兰.汉藏语形态变体的分化[J].民族语文,2002(1):29-33. 被引量:5
  • 2才藏太,华关加.班智达汉藏公文翻译系统中基于二分法的句法分析方法研究[J].中文信息学报,2005,19(6):7-12. 被引量:10
  • 3苏俊峰.基于HMM的藏语语料库词性自动标注研究[D].西北民族大学硕士学位论文,2010.
  • 4扎西次仁.一个人机互助的藏文分词和词登录系统的设计[C].中国少数民族语言文字现代化文集.北京:民族出版社,1999:322-327.
  • 5龙从军.藏语形容词性语素研究[J].JournalofChineseLanguageandComputing.2006,15(4):193—201.
  • 6J Lafferty, A McCallum, F Pereira. Conditional Ran- dom Fields~ Prohabilistic Models for Segmenting and Labeling Sequence Data[C~//Proceedings of ICML- 2001, 2001 :282-289.
  • 7Adam L Berger, Stephen A Della Pietra, Vincent J Della Pietra. A Maximum Entropy Approach to Natu- ral Language Processing[J].Computational Linguis- tics, 1996, 1(22):39-71.
  • 8康才唆.藏语分词与词性标注研究[D].上海师范大学博士学位论文,2014.
  • 9才智杰.藏文自动分词系统中紧缩词的识别[J].中文信息学报,2009,23(1):35-37. 被引量:70
  • 10才智杰,才让卓玛.班智达藏文标注词典设计[J].中文信息学报,2010,24(5):46-49. 被引量:15

共引文献27

同被引文献44

引证文献7

二级引证文献13

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部