期刊文献+

一种基于分类的平行语料选择方法 被引量:4

Selection of Parallel Corpus Based on Classification
下载PDF
导出
摘要 大规模高质量双语平行语料库是构造高质量统计机器翻译系统的重要基础,但语料库中的噪声影响着统计机器翻译系统的性能,因此有必要对大规模语料库中语料进行筛选。区别于传统的语料选择排序模型,本文提出一种基于分类的平行语料选择方法。通过少数句对特征构造差异较大的分类器训练句对,在该训练句对上使用更多的句对特征对分类器进行训练,然后对其他未分类句对进行分类。相比于基准系统,我们的方法不仅缩减40%训练语料规模,同时在NIST测试数据集合上将BLEU值提高了0.87个百分点。 Large-scale bilingual corpus is a fundamental resource to build a high-quality statistical machine translation system. However, there are usually a large number of noises in the corpus, which would affect the performance of translation system. Therefore, it is essential to filter noisy sentences. In this paper, we propose a classification based selection approach to distinguish high-quality bilingual sentences from the noisy ones. We first exploit several metrics to find the best and worst sentences in the corpus. Then we classify the rest sentences with the classifier, which is trained with more features on these sentences. Experimental results show that our approach not only eliminates 400/00 less promising sentences, but also significantly improves translation performance by 0.87 BLEU points over using all sentences.
出处 《中文信息学报》 CSCD 北大核心 2013年第6期144-150,共7页 Journal of Chinese Information Processing
基金 863重大项目课题(No.2011AA01A207) 国家自然科学基金资助项目(No.61003152 61272259)
关键词 统计机器翻译 平行语料选择 statistical machine translation bilingual corpus selection
  • 相关文献

参考文献17

  • 1Koehn P, Och F J, Marcu D. Statistical phrase-based translation[C]//Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Tech- nology-Volume 1. Association for Computational Lin- guistics, 2003~ 48-54.
  • 2Chiang D. A hierarchical phrase-based model for sta- tistical machine translation [C]//Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics. Association for Computational Linguis- tics, 2005~ 263-270.
  • 3Yang Liu, Qun Liu, Shouxun Lin. Tree-to-string a- lignment template for statistical machine translation [C]//Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguis- tics. 2006:609-616.
  • 4Jun Xie, Haitao Mi, Qun Liu. A novel dependency-to- string model for statistical machine translation[C]// Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2011: 216-226.
  • 5Och F J, Ney H. The alignment template approach to statistical machine translation[J]. Computational lin- guistics, 2004, 30(4): 417-449.
  • 6陈毅东,史晓东,周昌乐.平行语料库处理初探:一种排序模型[J].中文信息学报,2006,20(B03):66-70. 被引量:4
  • 7姚树杰,肖桐,朱靖波.基于句对质量和覆盖度的统计机器翻译训练语料选取[J].中文信息学报,2011,25(2):72-77. 被引量:11
  • 8黄瑾,吕雅娟,刘群.基于信息检索方法的统计翻译系统训练数据选择与优化[J].中文信息学报,2008,22(2):40-46. 被引量:9
  • 9Lu Y, Huang J, Liu Q. Improving statistical machine translation performance by training data selection and optimization[C]//Proceedings of the 2007 Joint Con- ference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL). 2007: 343-350.
  • 10Han X, Li H, Zhao T. Train the machine with what it can learn: corpus selection for SMT[C]//Proceed- ings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-parallel Corpora. Association for Computational Linguistics, 2009 : 27-33.

二级参考文献28

  • 1陈毅东,史晓东,周昌乐.平行语料库处理初探:一种排序模型[J].中文信息学报,2006,20(B03):66-70. 被引量:4
  • 2Philipp Koehn, Franz Josef Och, and Daniel Marcu. Statistical phrase-based translation [ C]//Proc. of HLT-NAACL, 2003. May: 127-133.
  • 3Yajuan Lti, Jin Huang and Qun Liu. Improving Statistical Machine Translation Performance by Training Data Selection and Optimization[C]//Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. 2007:343-350.
  • 4Matthias Eck, Stephan Vogel, Alex Waibei Low cost portability for statistical machine translation based on n-gram coverage[C]//MT Summit X: 2005:227-234.
  • 5Tong Xiao, Rushan Chen, Tianning Li, Muhua Zhu, Jingbo Zhu, ttuizhen Wang and Feiliang Ren. NEUTrans: a Phrase-Based SMT System for CWMT2009 [C]//5th China workshop on Machine Translation (CWMT), Nanjing, China, 2009: 40-46.
  • 6Deyi Xiong, Qun Liu and Shouxun Lin. Maximum Entropy Based Phrase Reordering Model for Statistical Machine Translation [ C]//Proc. of ACL Sydney, 2006 : 521-528.
  • 7Franz Josef Och Hermann Ney. The Alignment Template Approach to Statistical Machine Translation [C ]//Association for Computational Linguistics. 2004.
  • 8Philip Resnik, and Noah A. Smith,The Web as a Parallel Corpus [J]. Computational Linguistics, Sep. 2003,29(3):349-380.
  • 9Dragos S. Munteanu, Alexander Fraser, and Daniel Marcu,Improved machine translation pedormance via parallel sentence extraction from comparable corpora[A], In: Proceeding of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL 2004), Boston, MA,May 2004,265-272.
  • 10Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robet L. Mercer,The Mathematics of Statistical Machine Translation: Parameter Estimation[J], Computational Linguistics, 1993,19(2): 263-311.

共引文献15

同被引文献12

引证文献4

二级引证文献5

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部