一种基于分类的平行语料选择方法被引量：4

Selection of Parallel Corpus Based on Classification

下载PDF

导出

摘要大规模高质量双语平行语料库是构造高质量统计机器翻译系统的重要基础,但语料库中的噪声影响着统计机器翻译系统的性能,因此有必要对大规模语料库中语料进行筛选。区别于传统的语料选择排序模型,本文提出一种基于分类的平行语料选择方法。通过少数句对特征构造差异较大的分类器训练句对,在该训练句对上使用更多的句对特征对分类器进行训练,然后对其他未分类句对进行分类。相比于基准系统,我们的方法不仅缩减40%训练语料规模,同时在NIST测试数据集合上将BLEU值提高了0.87个百分点。 Large-scale bilingual corpus is a fundamental resource to build a high-quality statistical machine translation system. However, there are usually a large number of noises in the corpus, which would affect the performance of translation system. Therefore, it is essential to filter noisy sentences. In this paper, we propose a classification based selection approach to distinguish high-quality bilingual sentences from the noisy ones. We first exploit several metrics to find the best and worst sentences in the corpus. Then we classify the rest sentences with the classifier, which is trained with more features on these sentences. Experimental results show that our approach not only eliminates 400/00 less promising sentences, but also significantly improves translation performance by 0.87 BLEU points over using all sentences.

作者王星涂兆鹏谢军吕雅娟姚建民

机构地区苏州大学计算机科学与技术学院中国科学院计算技术研究所智能信息处理重点实验室加州大学戴维斯分校计算机科学系

出处《中文信息学报》 CSCD 北大核心 2013年第6期144-150,共7页 Journal of Chinese Information Processing

基金 863重大项目课题(No.2011AA01A207) 国家自然科学基金资助项目(No.61003152 61272259)

关键词统计机器翻译平行语料选择 statistical machine translation bilingual corpus selection

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献17

1Koehn P, Och F J, Marcu D. Statistical phrase-based translation[C]//Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Tech- nology-Volume 1. Association for Computational Lin- guistics, 2003~ 48-54.
2Chiang D. A hierarchical phrase-based model for sta- tistical machine translation [C]//Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics. Association for Computational Linguis- tics, 2005~ 263-270.
3Yang Liu, Qun Liu, Shouxun Lin. Tree-to-string a- lignment template for statistical machine translation [C]//Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguis- tics. 2006:609-616.
4Jun Xie, Haitao Mi, Qun Liu. A novel dependency-to- string model for statistical machine translation[C]// Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2011: 216-226.
5Och F J, Ney H. The alignment template approach to statistical machine translation[J]. Computational lin- guistics, 2004, 30(4): 417-449.
6陈毅东,史晓东,周昌乐.平行语料库处理初探：一种排序模型[J].中文信息学报,2006,20(B03):66-70. 被引量：4
7姚树杰,肖桐,朱靖波.基于句对质量和覆盖度的统计机器翻译训练语料选取[J].中文信息学报,2011,25(2):72-77. 被引量：11
8黄瑾,吕雅娟,刘群.基于信息检索方法的统计翻译系统训练数据选择与优化[J].中文信息学报,2008,22(2):40-46. 被引量：9
9Lu Y, Huang J, Liu Q. Improving statistical machine translation performance by training data selection and optimization[C]//Proceedings of the 2007 Joint Con- ference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL). 2007: 343-350.
10Han X, Li H, Zhao T. Train the machine with what it can learn: corpus selection for SMT[C]//Proceed- ings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-parallel Corpora. Association for Computational Linguistics, 2009 : 27-33.

二级参考文献28

1陈毅东,史晓东,周昌乐.平行语料库处理初探：一种排序模型[J].中文信息学报,2006,20(B03):66-70. 被引量：4
2Philipp Koehn, Franz Josef Och, and Daniel Marcu. Statistical phrase-based translation [ C]//Proc. of HLT-NAACL, 2003. May: 127-133.
3Yajuan Lti, Jin Huang and Qun Liu. Improving Statistical Machine Translation Performance by Training Data Selection and Optimization[C]//Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. 2007:343-350.
4Matthias Eck, Stephan Vogel, Alex Waibei Low cost portability for statistical machine translation based on n-gram coverage[C]//MT Summit X: 2005:227-234.
5Tong Xiao, Rushan Chen, Tianning Li, Muhua Zhu, Jingbo Zhu, ttuizhen Wang and Feiliang Ren. NEUTrans: a Phrase-Based SMT System for CWMT2009 [C]//5th China workshop on Machine Translation (CWMT), Nanjing, China, 2009: 40-46.
6Deyi Xiong, Qun Liu and Shouxun Lin. Maximum Entropy Based Phrase Reordering Model for Statistical Machine Translation [ C]//Proc. of ACL Sydney, 2006 : 521-528.
7Franz Josef Och Hermann Ney. The Alignment Template Approach to Statistical Machine Translation [C ]//Association for Computational Linguistics. 2004.
8Philip Resnik, and Noah A. Smith,The Web as a Parallel Corpus [J]. Computational Linguistics, Sep. 2003,29(3):349-380.
9Dragos S. Munteanu, Alexander Fraser, and Daniel Marcu,Improved machine translation pedormance via parallel sentence extraction from comparable corpora[A], In: Proceeding of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL 2004), Boston, MA,May 2004,265-272.
10Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robet L. Mercer,The Mathematics of Statistical Machine Translation: Parameter Estimation[J], Computational Linguistics, 1993,19(2): 263-311.

共引文献15

1黄瑾,吕雅娟,刘群.基于信息检索方法的统计翻译系统训练数据选择与优化[J].中文信息学报,2008,22(2):40-46. 被引量：9
2姚树杰,肖桐,朱靖波.基于句对质量和覆盖度的统计机器翻译训练语料选取[J].中文信息学报,2011,25(2):72-77. 被引量：11
3韩芳,杨天心,宋继华.基于句本位句法体系的古汉语机器翻译研究[J].中文信息学报,2015,29(2):103-110. 被引量：6
4张海阳,马晓雷,张宗波.国内机器翻译领域研究动态科学知识图谱分析(1995-2015)[J].海军工程大学学报（综合版）,2015,12(4):81-85. 被引量：2
5司莉,何依.2000年以来我国多语言语料库研究进展[J].现代情报,2016,36(6):165-170. 被引量：2
6丁亮,李颖,何彦青,王星,张运良,姚长青.基于汉语主题词表的统计机器翻译训练数据筛选方法及实验研究[J].情报学报,2016,35(8):875-884. 被引量：9
7姚亮,洪宇,刘昊,刘乐,姚建民.基于翻译模型和语言模型相融合的双语句对选择方法[J].中文信息学报,2016,30(5):145-152. 被引量：2
8孔金英,温政阳,杨雅婷,王磊,李晓.面向维汉机器翻译的语料筛选技术研究[J].计算机应用研究,2016,33(12):3654-3657. 被引量：2
9丁亮,李颖,何彦青.统计机器翻译领域自适应方法比较研究[J].情报工程,2016,2(4):80-88. 被引量：2
10朱少林,杨雅婷,米成刚,李晓,王磊.基于双语句对覆盖度的维汉机器翻译语料选取技术[J].中国科学技术大学学报,2017,47(4):283-289. 被引量：1

同被引文献12

1陈毅东,史晓东,周昌乐.平行语料库处理初探：一种排序模型[J].中文信息学报,2006,20(B03):66-70. 被引量：4
2黄瑾,吕雅娟,刘群.基于信息检索方法的统计翻译系统训练数据选择与优化[J].中文信息学报,2008,22(2):40-46. 被引量：9
3曹杰,吕雅娟,苏劲松,刘群.利用上下文信息的统计机器翻译领域自适应[J].中文信息学报,2010,24(6):50-56. 被引量：4
4姚树杰,肖桐,朱靖波.基于句对质量和覆盖度的统计机器翻译训练语料选取[J].中文信息学报,2011,25(2):72-77. 被引量：11
5王志洋,吕雅娟,刘群.面向形态丰富语言的多粒度翻译融合[J].中文信息学报,2011,25(4):75-81. 被引量：3
6冯洋,张冬冬,刘群.层次短语翻译模型的介词短语调序[J].中文信息学报,2012,26(1):31-36. 被引量：3
7肖欣延,刘洋,刘群,林守勋.面向层次短语翻译的词汇化调序方法研究[J].中文信息学报,2012,26(1):37-41. 被引量：6
8庞弘燊,方曙,杨志刚,郭学武.研究领域的主题发展趋势分析方法研究——基于多重共现的视角[J].情报理论与实践,2012,35(8):44-47. 被引量：11
9米成刚,王磊,杨雅婷,陈科海.维汉机器翻译未登录词识别研究[J].计算机应用研究,2013,30(4):1112-1115. 被引量：9
10梁华参,赵铁军.统计机器翻译中双语语料的过滤及词对齐的改进[J].智能计算机与应用,2013,3(4):10-13. 被引量：3

引证文献4

1姚亮,洪宇,刘昊,刘乐,姚建民.基于翻译模型和语言模型相融合的双语句对选择方法[J].中文信息学报,2016,30(5):145-152. 被引量：2
2孔金英,温政阳,杨雅婷,王磊,李晓.面向维汉机器翻译的语料筛选技术研究[J].计算机应用研究,2016,33(12):3654-3657. 被引量：2
3朱少林,杨雅婷,米成刚,李晓,王磊.基于双语句对覆盖度的维汉机器翻译语料选取技术[J].中国科学技术大学学报,2017,47(4):283-289. 被引量：1
4刘梦眙,姚亮,洪宇,刘昊,姚建民.译文语序的领域性思考:一种融合主题信息的领域自适应调序模型[J].中文信息学报,2017,31(5):50-58.

二级引证文献5

1倪文琼,刘玲玲,秦俭,马圣清.氯雷他定治疗异位性皮炎前后血清IL-4和总IgE的临床研究[J].中国皮肤性病学杂志,2000,14(1):19-20. 被引量：3
2沈菲菲.基于Android的旅游翻译助手设计[J].自动化与仪器仪表,2017(8):68-70. 被引量：1
3张海洋.基于语义选择的机器翻译方法研究[J].自动化与仪器仪表,2018,0(8):29-32. 被引量：3
4周红,周明理,姜思佳.基于云平台的计算机辅助翻译研究与实践[J].大众科技,2021,23(2):10-12. 被引量：2
5杨璐,樊同科.基于Cloud LM算法的机器翻译语言模型设计与应用[J].信息技术,2022,46(5):48-52. 被引量：2

1顾亚萍.语法没有“圈” 课堂更精彩[J].中学生英语（中旬刊）,2011(6):52-53.
2何丹,马敬奇.基于一种改进型的三阶混沌系统图像加密算法分析[J].福建电脑,2016,32(8):103-103.
3李玲玲,贾振红,覃锡忠,杨杰,Nikola Kasabov.基于分形维数和FCM聚类的多时相SAR图像变化检测[J].激光杂志,2014,35(7):15-18. 被引量：1
4周雁,赵栋材.基于HMM模型的藏语语音合成研究[J].计算机应用与软件,2015,32(5):171-174. 被引量：5
5车德欣,汤子隆.一种新的分数阶多卷波混沌系统及其随机性测试[J].自动化与信息工程,2015,36(6):29-33.
6许可,迟名远,王成友,蔡宣平.基于语料库相似度的语料选择[J].计算机工程,2007,33(17):231-233.
7郭英俊.基于改良的特征体三维重建算法[J].山东煤炭科技,2012,30(4):165-166.
8余丽,陆锋,刘希亮,程诗奋,张雪英.稀疏地理实体关系的关键词提取方法[J].地球信息科学学报,2016,18(11):1465-1475. 被引量：9
9高树静,王洪君.UHF RFID标签的伪随机数发生器研究[J].计算机科学,2013,40(7):102-106. 被引量：1
10吴柯,牛瑞卿,王毅,杜博.基于PCA与EM算法的多光谱遥感影像变化检测研究[J].计算机科学,2010,37(3):282-284. 被引量：32

中文信息学报

2013年第6期

浏览历史

内容加载中请稍等...

一种基于分类的平行语料选择方法被引量：4

参考文献17

二级参考文献28

共引文献15

同被引文献12

引证文献4

二级引证文献5

相关作者

相关机构

相关主题

浏览历史

一种基于分类的平行语料选择方法 被引量：4

参考文献17

二级参考文献28

共引文献15

同被引文献12

引证文献4

二级引证文献5

相关作者

相关机构

相关主题

浏览历史

一种基于分类的平行语料选择方法被引量：4