期刊文献+

基于双语URL匹配模式可信度的平行网页识别研究 被引量:3

Detection of Parallel Web Pages Based on the Automatically Discovered Bilingual URL Pairing Patterns
下载PDF
导出
摘要 平行语料是自然语言处理中一项重要的基础资源,在双语平行网页中大量存在。该文首先介绍双语URL匹配模式的可信度计算方法,然后提出基于局部可信度的双语平行网页识别算法,再依据匹配模式的全局可信度,提出两种优化方法:即利用全局可信度,救回因低于局部可信度阈值而被初始算法滤掉的匹配模式;通过全局可信度和网页检测方法,挖出深层网页。进一步,结合网站双语可信度、链接关系,侦测出种子网站周边更多较具可信度的双语网站。除了双语URL匹配模式自动识别,还利用搜索引擎,依据少数高可信度的匹配模式快速识别双语网页。为了提高以上五种方法识别候选双语网页对的准确率,计算了候选双语网页对的双语相似度,并设置阈值过滤非双语网页对。通过实验验证了所提方法的有效性。 Parallel corpora are one of the most important resources for natural language processing,a large volume of which can be mined from bilingual parallel web pages.This paper formulates a practical algorithm for recognizing parallel web pages based on the credibility of automatically discovered bilingual URL pairing patterns(or keys),then this paper extends it in two ways to find more parallel web pages,namely,rescue weak keys of low local credibility in terms of their global credibility,and unearth bilingual parallel deep web pages by means of applying strong keys of high global credibility.Furthermore,we detect more bilingual web sites according to their credibility in terms of their link relationship with the seed set of web sites in use,and also utilize search engines to recognize bilingual web sites efficiently with only a small set of URL pairing patterns of high credibility.To further enhance the recognition accuracy on top of these five methods,we calculate cross-lingual similarity of candidate parallel web pages and filter out weak ones with a threshold.The effectiveness of our approaches is confirmed by a series of experiments.
作者 章成志 马舒天 揭春雨 姚旭晨 ZHANG Chengzhi;MA Shutian;KIT Chunyu;YAO Xuchen(Department of Information Management, Nanjing University of Science & Technology, Nanjing, Jiangsu 210094, China;Department of Linguistics and Translation, City University of Hong Kong, Hong Kong, China;Baidu Online Network Technology (Beijing) Co. Ltd. , Beijing 100085, China)
出处 《中文信息学报》 CSCD 北大核心 2018年第3期91-100,共10页 Journal of Chinese Information Processing
基金 香港城市大学SRG-Fd项目(7008003) 香港研资局GRF项目(CityU 144410 11600415) 国家自然科学基金(70903032)
关键词 平行网页获取 平行语料库 双语URL匹配模式 双语文本挖掘 parallel webpage mining parallel corpora bilingual URL pairing pattern bilingual text mining
  • 相关文献

参考文献1

二级参考文献16

  • 1Koehn P, Och F J, Marcu D. Statistical phrase-based translation[C]//Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Tech nology-Volume 1. Association for Computational Lin- guistics, 2003: 48-54.
  • 2Chiang D. Hierarchical phrase based translation [J ]. computational linguistics, 2007, 33(2) : 201-228.
  • 3Galley M, Graehl J, Knight K, et al. Scalable infer ence and training of context-rich syntactic translation models[C]//Proceedings of the 21st International Con ference on Computational Linguistics and the 44th An nual Meeting of the Association for Computational Lin- guistics. Association for Computational I.inguistics, 2006: 961-968.
  • 4Munteanu D S, Marcu D. Improving machine transla- tion performance by exploiting non parallel corpora [J]. Computational Linguistics, 2005, 31 (4) : 477- 504.
  • 5Ma X, Liberman M. Bits= A method for bilingual text search over the web[C]//Machine Translation Summit VII. 1999:538-542.
  • 6Chen J, Nie J Y. Automatic construction of parallel English-Chinese corpus for cross-language information retrieval[C]//Proeeedings of the 16th Conference on Applied Natural Language Processing. Association forComputational Linguistics, 2000: 21-28.
  • 7Resnik P, Smith N A. The web as a parallel corpus [J]. Computational Linguistics, 2003, 29 (3) : 349- 380.
  • 8Chen J, Chau R, Yeh C H. Discovering parallel text from the World Wide Web[C]//Proceedings of the second workshop on Australasian information security, Data Mining and Web Intelligence, and Software Inter- nationalisation-Volume 32. Australian Computer Soci- ety, Inc. , 2004: 157-161.
  • 9Shi L, Niu C, Zhou M, et al. A dora tree alignment model for mining parallel data from the weh[C]//Pro- ceedings of the 21st International Conference on Com- putational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics. Associ- ation for Computational Linguistics, 2006: 489-496.
  • 10Zhang Y, Wu K, Gao J, et al. Automatic Acquisition of Chinese-English Parallel Corpus from the Web [M]//Advances in Information Retrieval. Springer Berlin Heidelberg, 2006: 420-431.

共引文献5

同被引文献27

引证文献3

二级引证文献5

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部