

Design and Implementation of Online Building System for Chinese-English News Comparable Corpora
摘要 可比语料库是重要的基础资源,在线挖掘可比语料是构建大规模可比语料库的有效途径,合适的语料来源网站和有效的可比度计算方法能够简化在线挖掘过程。选择环球时报英文版和凤凰网作为语料来源,设计了一个中英新闻可比语料库在线构建系统。测试结果表明,系统能够连续稳定地生成可比语料。 Comparable corpora are useful lingual resources.Mining comparable texts online from the web is an effective way to building comparable corpora of large scale.Suitable source websites and effective comparability measurement will facilitate the mining process.An online mining system for Chinese-English bilingual news comparable corpus is designed with globaltimes.cn and ifeng.com as the English and Chinese news source websites respectively.The system test results indicate that it can output comparable news pair steadily.
作者 赵永标 张其林 谷琼 ZHAO Yongbiao;ZHANG Qilin;GU Qiong(School of Computer Engineering,Hubei University of Science and Arts,Xiangyang 441053,Hubei,China)
出处 《安顺学院学报》 2019年第3期121-124,共4页 Journal of Anshun University
基金 国家语委十三五科研规划项目“基于主题模型的Web可比语料在线挖掘研究”(项目编号:YB135-22)
关键词 双语语料库 可比语料库 可比度 新闻 bilingual corpora comparable corpora comparability news
  • 相关文献



  • 1刘超朋.平行语料库概述[J].燕山大学学报(哲学社会科学版),2007,8(S1):120-121. 被引量:10
  • 2TAO Tao,ZHAI Cheng-xiang. Mining comparable bilingual text corpora for cross-language information integration [C]//Proceedings of the 8th ACM SIGKDD International Conference on Knowledge and Data Mining. New York: ACM Press ,2005 : 691-696.
  • 3VU T,AW A T,ZHANG Min. Feature-based method for document alignment in comparable news corpora[C]//Proceeding s of the 12th Conference of the European Chapter of the ACL. Morristown,NJ:ACL,2009:843-851.
  • 4TUOMAS T,ARI P,KALERVO J ,et al. Focused web crawling in the acquisition of comparable corpora[J]. Information Retrieval, 2008,11 (5) : 427-445.
  • 5RAPP R. Identifying word translations in non-parallel texts[C]//Proeeedings of the 33rd Annual Meeting on Association for Computational Linguistics. Morristown ,NJ :ACL, 1995 : 320-322.
  • 6FUNG P. A statistical view on bilingual lexicon extraction:from parallel corpora to non-parallel corpora[C]//Machine Translation and the Information Soup ;LNCS Vol 1529. Berlin:Springer-Verlag, 1998 : 1-17.
  • 7TALVENSAARI T. Effects of aligned corpus quality and size in corpus-based CLIR[C]//Proeeedings of the IR Research, 30th European Conference on Advances in Information Retrieval. Berlin : Springer-Verlag, 2008 : 114-125.
  • 8CHENG Pu-jen,TENG Jei-wen,CHEN Ruei-eheng,et al. Translating unknown queries with web corpora for crosslanguage information retrieval[C]//Proceeding of 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York:ACM,2004:146-153.
  • 9徐戈,王厚峰.自然语言处理中主题模型的发展[J].计算机学报,2011,34(8):1423-1436. 被引量:233
  • 10才让加.面向自然语言处理的大规模汉藏(藏汉)双语语料库构建技术研究[J].中文信息学报,2011,25(6):157-161. 被引量:18









使用帮助 返回顶部