摘要
双语短语对抽取是基于短语的统计机器翻译中短语翻译模型训练的关键步骤,但由于汉维平行语料库规模有限,数据稀疏问题严重.本文提出了一种改进的短语抽取算法,该算法首先考虑词对齐矩阵中一个汉语词对齐到多个维吾尔语词的情况(包括不连续),然后利用Och方法抽取短语对,最后考虑维吾尔语SOV语序结构特点,抽取双语短语.实验表明,该算法能够较准确地且尽可能多地抽取汉维短语对,从而提高翻译模型的质量.
Bilingual phrases pairs extraction is a key step that training phrase translation model in the phrase-based statistical machine translation, however, due to the limited size of bilingual parallel corpora, the sparse data problem is very serious. Improved approach of phrases extraction was proposed, firstly this algorithm considers a Chinese word to multi-Uyghur words (including nonconsecutive), and it also extracts phrases pairs using Och's method, in the end we extracts phrases considering SOV sentence structure in Uyghur. Experiments show that the algorithm can extract bilingual phrases translation pairs accurately at the same time extract phrases as much as possible. So it improves the quality of the translation model.
出处
《新疆大学学报(自然科学版)》
CAS
2010年第3期349-352,共4页
Journal of Xinjiang University(Natural Science Edition)
基金
国家自然科学基金项目(60663006
60763006)
关键词
统计机器翻译
短语抽取
汉维短语对
Statistical Machine Translation
phrase extraction
Chinese-Uyghur phrase pairs