摘要
双语句对选择方法旨在从大规模通用领域双语语料库中,自动抽取与待翻译文本领域相关性较高的句对,以缓解特定领域翻译模型训练语料不足的问题。区别于原有基于语言模型的双语句对选择方法,该文从句对生成式建模的角度出发,提出一种基于翻译模型和语言模型相融合的双语句对选择方法。该方法能够有效评价双语句对的领域相关性及互译性。实验结果显示,利用该文所提方法选择双语句对训练所得翻译系统,相比于基准系统,在测试集上性能提升3.5个BLEU值;此外,针对不同句对质量评价特征之间的权重调节问题,该文提出一种基于句对重排序的特征权重自动优化方法。基于该方法的机器翻译系统性能继续提升0.68个BLEU值。
Data Selection aims at selecting sentence pairs most relevant to target domain from large scale general-domain bilingual corpus that are , so as to alleviate the lack of high quality bi-text for statistical machine translation in the domain of interest. Instead of solely using traditional language models, we propose a novel approach combining translation models with language models for data selection from the perspective of generative modeling. The approach can better measure the relevance between sentence pairs and the target domain, as well as the translation probability of sentence pair. Experiments show that the optimized system trained on selected bi-text using our methods outperforms the baseline system trained on general-domain corpus by 3.5 BLEU points. In addition, we present an effective method based on sentence pairs re-ranking to tune the weights of different features which are used for evaluating quality of general domain texts. Machine translation system based on this method achieves further imporvments of 0.68 BLEU points.
出处
《中文信息学报》
CSCD
北大核心
2016年第5期145-152,共8页
Journal of Chinese Information Processing
基金
国家自然科学基金(61373097
61272259
61272260)
关键词
双语句对选择
生成式建模
翻译模型
语言模型
权重调节
bilingual data selection
generative modeling
translation model
language model
weight tuning