期刊文献+

基于翻译模型和语言模型相融合的双语句对选择方法 被引量:2

Combining Translation and Language Models for Bilingual Data Selection
下载PDF
导出
摘要 双语句对选择方法旨在从大规模通用领域双语语料库中,自动抽取与待翻译文本领域相关性较高的句对,以缓解特定领域翻译模型训练语料不足的问题。区别于原有基于语言模型的双语句对选择方法,该文从句对生成式建模的角度出发,提出一种基于翻译模型和语言模型相融合的双语句对选择方法。该方法能够有效评价双语句对的领域相关性及互译性。实验结果显示,利用该文所提方法选择双语句对训练所得翻译系统,相比于基准系统,在测试集上性能提升3.5个BLEU值;此外,针对不同句对质量评价特征之间的权重调节问题,该文提出一种基于句对重排序的特征权重自动优化方法。基于该方法的机器翻译系统性能继续提升0.68个BLEU值。 Data Selection aims at selecting sentence pairs most relevant to target domain from large scale general-domain bilingual corpus that are , so as to alleviate the lack of high quality bi-text for statistical machine translation in the domain of interest. Instead of solely using traditional language models, we propose a novel approach combining translation models with language models for data selection from the perspective of generative modeling. The approach can better measure the relevance between sentence pairs and the target domain, as well as the translation probability of sentence pair. Experiments show that the optimized system trained on selected bi-text using our methods outperforms the baseline system trained on general-domain corpus by 3.5 BLEU points. In addition, we present an effective method based on sentence pairs re-ranking to tune the weights of different features which are used for evaluating quality of general domain texts. Machine translation system based on this method achieves further imporvments of 0.68 BLEU points.
出处 《中文信息学报》 CSCD 北大核心 2016年第5期145-152,共8页 Journal of Chinese Information Processing
基金 国家自然科学基金(61373097 61272259 61272260)
关键词 双语句对选择 生成式建模 翻译模型 语言模型 权重调节 bilingual data selection generative modeling translation model language model weight tuning
  • 相关文献

参考文献3

二级参考文献37

  • 1陈毅东,史晓东,周昌乐.平行语料库处理初探:一种排序模型[J].中文信息学报,2006,20(B03):66-70. 被引量:4
  • 2Philipp Koehn, Franz Josef Och, and Daniel Marcu. Statistical phrase-based translation [ C]//Proc. of HLT-NAACL, 2003. May: 127-133.
  • 3Yajuan Lti, Jin Huang and Qun Liu. Improving Statistical Machine Translation Performance by Training Data Selection and Optimization[C]//Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. 2007:343-350.
  • 4Matthias Eck, Stephan Vogel, Alex Waibei Low cost portability for statistical machine translation based on n-gram coverage[C]//MT Summit X: 2005:227-234.
  • 5Tong Xiao, Rushan Chen, Tianning Li, Muhua Zhu, Jingbo Zhu, ttuizhen Wang and Feiliang Ren. NEUTrans: a Phrase-Based SMT System for CWMT2009 [C]//5th China workshop on Machine Translation (CWMT), Nanjing, China, 2009: 40-46.
  • 6Deyi Xiong, Qun Liu and Shouxun Lin. Maximum Entropy Based Phrase Reordering Model for Statistical Machine Translation [ C]//Proc. of ACL Sydney, 2006 : 521-528.
  • 7Franz Josef Och Hermann Ney. The Alignment Template Approach to Statistical Machine Translation [C ]//Association for Computational Linguistics. 2004.
  • 8LDC (Linguistic Data Consortium) [EB/OL]. http:// www. ldc. upenn, edu/.
  • 9Milind Mahajan, Doug Beeferman, X.D. Huang. Improved Topic-Dependent Language Modeling Using Information Retrieval Techniques[A]. IEEE International Conference on Acoustics, Speech and Signal Processing[C]. 1999, Volume 1:541-544.
  • 10Matthias Eck, Stephan Vogel, Alex Waibel. Language model adaptation for statistical machine translation based on information retrieval [A]. International Conference on Language Resources and Evaluation [C]. 2004.

共引文献15

同被引文献26

引证文献2

二级引证文献2

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部