期刊文献+

Web平行语料挖掘及其在机器翻译中的应用 被引量:5

Mining Parallel Corpora from Web and Its Application in Machine Translation
下载PDF
导出
摘要 双语平行语料库在自然语言处理领域有很多重要应用,但是大规模双语平行语料库的自动获取并不容易。该文提出了一种有效的从Web上获取高质量双语平行语料库的方案,研究了候选双语混合网页获取和平行句对抽取等关键技术。运用该文方法共获取了258万双语平行句对,平均正确率为93.75%,其中前150万句对的平均正确率达到96%。该文还提出句对质量排序和领域信息检索两种方法将Web数据应用于统计机器翻译的模型训练,在IWSLT评测数据上BLEU值可以提高2到5个百分点。 Bilingual parallel corpora can be used in many applications of NI.P, but it's not easy to acquire the large scale corpora automatically. This paper proposes an effective solution to mine high-quality bilingual parallel corpora from web pages and analyses the key technology of obtaining eandidate mix-languages web pages and sentence align- ment. We have extracted 1.67 million parallel sentences, whieh average accuracy is 93.75%, and the accuracy of the first 1 million sentences is 96%. This paper also proposes the sentences re ranking method and domain informa tion retrieval method to apply the web data to the training of SMT model. Experiments conducted on the IWSLT tasks show 2 to 5 BI.EU gains over baseline.
出处 《中文信息学报》 CSCD 北大核心 2010年第5期85-91,共7页 Journal of Chinese Information Processing
基金 国家自然科学基金资助项目(60603095)
关键词 WEB挖掘 平行语料库 句子对齐 统计机器翻译 Web mining parallel corpora sentence alignment statistical machine translation
  • 相关文献

参考文献16

  • 1Peter F.Brown,John Cocke,Stephen A,et al..A Statistical Approach to Machine Translation:Parameter Estimation[J].Computational Linguistics,1990,volume 16:79-85.
  • 2孙乐,金友兵,杜林,孙玉芳.平行语料库中双语术语词典的自动抽取[J].中文信息学报,2000,14(6):33-39. 被引量:30
  • 3冯志伟.中国语料库研究的历史与现状.Journal of Chinese Language and Computing,2002,11(2):127-136.
  • 4Resnik,p.and N.A.Smith..The web as a Parallel Corpus[J].Comoutational Linguistics,2003,volume 29:349-380.
  • 5叶莎妮,吕雅娟,黄赟,刘群.基于Web的双语平行句对自动获取[J].中文信息学报,2008,22(5):67-73. 被引量:12
  • 6Lei Shi,Cheng Niu,Ming Zhou,,et al.A DOM Tree Alignment Model for Mining Parallel Data from the Web[C]//Joint Pro-ceedings of the Association for Computational Linguistics and the International Conference on Computational Linguistics,Sydney,Australia,2006:489-496.
  • 7Lei Shi,Ming Zhou:Improved Sentence Alignment on Parallel Web Pages Using a Stochastic Tree Alignment Model[C]//EMNLP,2008:505-513.
  • 8Long Jiang,Shiquan Yang,Ming Zhou,et al.Mining Bilingual Data from the Web with Adaptively Learnt Patterns[C]//Joint conference of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing,2009:870-878.
  • 9林政,吕雅娟,刘群,等.基于双语混和网页的平行语料挖掘[C]//全国第十届计算语言学会,烟台,2009:352-357.
  • 10刘非凡,赵军,徐波.大规模非限定领域汉英双语语料库建设及句子对齐研究[C]//全国第七届计算语言学联合学术会议,哈尔滨,2003:339-345.

二级参考文献20

  • 1王斌.汉语语料库自动对齐研究(博士学位论文)[M].北京:中国科学院计算技术研究所,1999..
  • 2Sun Le,ProceedingoftheworkshopMAL’99,1999年,135页
  • 3王斌,博士学位论文,1999年
  • 4Chang J S,Proceedingsofthe 35thMeetingoftheAssociationforComputationalLinguistics,Madrid,1997年,297页
  • 5Wu Daikai,MachineTranslation,1995年,9卷,3/4期,285页
  • 6Fung P,Proceedingsofthe 15thInternationalConferenceonComputationalLinguistics (COLING。?994年,1096页
  • 7Wu Daikai,Proceedingsofthe 32ndAnnualMeetingoftheAssociationforComputationalLinguistics (,1994年,80页
  • 8Chen S F,Proceedingsofthe 31thAnnualMeetingoftheAssociationforComputationalLinguistics(A,1993年,9页
  • 9Brown P F,Proceedingsofthe 2 9thAn nualMeetingoftheAssociationforComputationalLinguistics,1991年,169页
  • 10JiangChen and Jian-Yun Nie. Automatic construction of parallel english-chinese corpus for cross-language information retrieval[C]//Proceedings of the International Conference on Chinese Language Computing. San Francisco: 2000: 21-28.

共引文献66

同被引文献87

引证文献5

二级引证文献44

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部