期刊文献+

从互联网上自动获取领域平行语料 被引量:2

Automatic Acquisition of Domain Parallel Corpora from Internet
原文传递
导出
摘要 【目的】对获取的双语语料进行分类,对分类后的双语语料进行句子对齐处理,生成领域平行语料。【方法】利用基于SVM算法的文本分类器对获取的中英双语语料进行分类。使用长度法和词汇法相结合的句子对齐工具对分类后的语料进行句子对齐工作,为提高句子对齐的正确率,利用人工对齐的中英平行语料计算中英文句子长度参数,结合中英双语词典,获取高质量的专业领域平行语料。【结果】使用该方法,对每个领域语料进行句子对齐后,取得95.45%的句子对齐正确率。计算得到的句子平均长度比为1.7777,方差为1.2640。【局限】由于双语语料的初始对齐程度比较好,因此句子对齐正确率可能不具有普遍代表性。【结论】从实验结果看,该方法是有效的,能够获取质量令人满意的领域平行语料。 [Objective] To automatically obtain domain parallel corpora via classified bilingual corpora and sentence alignment. [Methods] Classify bilingual corpora based on text classification technology, use sentence alignment tool to align classified bilingual corpus based on length information of bilingual sentence and bilingual dictionary. This paper uses artificial aligned bilingual corpora to calculate length parameters. [Results] The results obtain 95.45% rate of sentence aligned correctly. The length mean is 1.7777 and variance is 1.2640. [Limitations] Due to the extent of the initial alignment of bilingual corpus is satisfied, so the result of alignment is not universally representative. [Conclusions] The result proves the method presented in this paper is effective, so this method can acquire high quality domain parallel corpora.
作者 邵健 章成志
出处 《现代图书情报技术》 CSSCI 北大核心 2014年第12期36-43,共8页 New Technology of Library and Information Service
关键词 句子对齐 文本分类 平行语料 机器翻译 Sentence alignment Text classification Parallel corpora Machine translation
  • 相关文献

参考文献30

  • 1Koehn P. Europarl: A Parallel Corpus for Statistical Machine Translation [C]. In: Proceedings of the 10th Machine Translation Summit, Phuket, Thailand. 2005: 79-86.
  • 2吴琳,魏星,霍翠婷.基于Web的专利双语语料自动获取研究及实现——以esp@cenet数据库为例[J].现代图书情报技术,2009(9):57-63. 被引量:9
  • 3Resnik P, Smith N A. The Web as a Parallel Corpus [J]. Computational Linguistics, 2003, 29(3): 349-380.
  • 4Ma X, Liberman M Y. BITS: A Method for Bilingual Text Search over the Web [C]. In: Proceedings of Machine Translation Summit VII, Singapore. 1999.
  • 5Chen J, Nie J. Automatic Construction of Parallel English- Chinese Corpus for Cross-Language Information Retrieval [C]. In: Proceedings of the 6th Applied Natural Language Processing Conference, Seattle, Washington, USA. 2000: 21-28.
  • 6Zhang Y, Wu K, (Jao J, et al. Automatic cqulsauon ul Chinese-English Parallel Corpus from the Web [C]. In: Proceedings of the 28th European Conference on IR Research, London, UK. Springer Berlin Heidelberg, 2006: 420-431.
  • 7Zhang C Z, Yao X C, Kit C. Finding More Bilingual Web Pages with High Credibility via Link Analysis [C]. In: Proceedings of the 6th Workshop on Building and UsingComparable Corpora, Sofia, Bulgaria. 2013.
  • 8刘奇,刘洋,孙茂松.URL模式与HTML结构相结合的平行网页获取方法[J].中文信息学报,2013,27(3):91-99. 被引量:6
  • 9Gale W A, Church K W. A Program for Aligning Sentences in Bilingual Corpora [J]. Computational Linguistics, 1993, 19(1): 75-102.
  • 10Brown P F, Lai J C, Mercer R L. Aligning Sentences in Parallel Corpora [C]. In: Proceedings of the 29th Annual Meeting on Association for Computational Linguistics, 1991: 169-176.

二级参考文献43

  • 1Zhang Y, Vines P. Using the Web for Automated Translation Extraction in Cross - language Information Retrieval [ C ]. In: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 2004: 162 - 169.
  • 2Huang F, Zhang Y,Vogel S. Mining Key Phrase Translations from Web Corpora [ C ]. In: Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing. 2005:483-490.
  • 3Resnik P. Parallel Strands : A Preliminary Investigation into Mining the Web for Bilingual Text[ C ]. In : Proceedings of the 5rd Conference of the Association for Machine Translation, America. 1998:72-82.
  • 4Resnik P, Smith N A. The Web as a Parallel Corpus[J]. Computational Linguistics, 2003,29 (3) :349 - 380.
  • 5Chen J, Nie J Y. Automatic Construction of Parallel English - Chinese Corpus for Cross - language Information Retrieval [ C ]. In : Proceedings of the International Conference on Chinese Language Computing, San Francisco. 2000 : 21 - 28.
  • 6欧洲专利局数据库[EB/OL].hup://ep.espacenet.com/.2005-12.
  • 7Koehn P, Och F J, Marcu D. Statistical phrase-based translation[C]//Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Tech nology-Volume 1. Association for Computational Lin- guistics, 2003: 48-54.
  • 8Chiang D. Hierarchical phrase based translation [J ]. computational linguistics, 2007, 33(2) : 201-228.
  • 9Galley M, Graehl J, Knight K, et al. Scalable infer ence and training of context-rich syntactic translation models[C]//Proceedings of the 21st International Con ference on Computational Linguistics and the 44th An nual Meeting of the Association for Computational Lin- guistics. Association for Computational I.inguistics, 2006: 961-968.
  • 10Munteanu D S, Marcu D. Improving machine transla- tion performance by exploiting non parallel corpora [J]. Computational Linguistics, 2005, 31 (4) : 477- 504.

共引文献15

同被引文献11

引证文献2

二级引证文献4

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部