期刊文献+

基于HITS算法的双语句对挖掘优化方法 被引量:5

HITS-Based Optimization Method for Bilingual Corpus Mining
下载PDF
导出
摘要 识别和定位特定领域双语网站,是基于Web自动构建特定领域双语语料库的关键。然而,特定领域双语网站之间的句对质量往往差异较大。相对于原有基于句对文本特征识别过滤质量较差句对的方法。该文从句对的来源(即特定领域双语网站)出发,依据领域权威性高的网站往往蕴含高质量平行句对这一假设,提出一种基于HITS算法的双语句对挖掘优化方法。该方法通过网站之间的链接信息建立有向图模型,利用HITS算法度量网站的权威性,在此基础上,仅从权威性高的网站中抽取双语句对,用于训练特定领域机器翻译系统。该文以教育领域为目标,验证"领域权威性高的网站蕴含高质量句对"假设的可行性。实验结果表明,利用该文所提方法挖掘双语句对训练的翻译系统,相比于基准系统,其平均性能提升0.44个BLEU值。此外,针对HITS算法存在的"主题偏离"问题,该文提出基于GHITS的改进算法。结果显示,基于GHITS算法改进的机器翻译系统,其性能继续提升0.40个BLEU值。 Identifying and locating domain-specific bilingual websites is a crucial step for the Web-based bilingual resource construction. However, the quality of sentence pairs varies among different bilingual websites. In contrast to the existing method focusing only on the sentence internal features, we explore the sentence pairs' origin information for identifying and filtering the low-quality sentences pairs. We hypothesize that, if a website is authoritative in the target domain, it tends to contain more high-quality sentence pairs. Thus, we propose a HITS based optimization method for mining domain-specific bilingual sentence pairs. In this method, we first construct a directed-graph model based on the link-info among the websites. Secondly, we propose a HITS based method for evaluating the authori- ty of websites. Finally, we only extract the sentence pairs from the authoritative websites, and use them to enlarge the training-set of our machine translation system. Experimented on the education domain, our system achieves improvements of 0.44% BLEU score compared with existing method. A further proposed GHITS method achieve ad- ditional improvements of 0.40% BLEU score.
出处 《中文信息学报》 CSCD 北大核心 2017年第2期25-35,共11页 Journal of Chinese Information Processing
基金 国家自然科学基金(61373097 61272259 61272260 90920004) 教育部博士学科点专项基金(2009321110006 20103201110021) 江苏省自然科学基金(BK2011282) 江苏省高校自然科学基金重大项目(11KJA520003) 苏州市自然科学基金(SH201212)
关键词 统计机器翻译 特定领域机器翻译 特定领域双语网站 权威性 statistical machine translation speci[ic-domain machine translation specific-domain bilingual websitesauthority HITS
  • 相关文献

参考文献4

二级参考文献50

  • 1陈毅东,史晓东,周昌乐.平行语料库处理初探:一种排序模型[J].中文信息学报,2006,20(B03):66-70. 被引量:4
  • 2Lempei,R.,Moran,S.The stochastic approach for link-structure analysis (SALSA) and the TKC effect.The 9^th International World Wide Web Conferences(WWW9/Computer Networks),2000,33(1-6): 387-401.
  • 3Madria SK,et al.Research issues in Web data mining.Data Warehousing and wledge Discovery,First DaWak' 99,1999.303-312.
  • 4S.K.M.Wong,W.Ziarko,V.V.Raghavan,P.C.N.Wong.On modeling of information retrieval concepts in vector spaces.June 1987 ACM Transactions on Database System(TODS),Volume 12 Issue 2.
  • 5S.K.M.Wong,Vijay.V.Raghavan.Vector space model of information retrievaha reevaluation.Annual ACM Conference on Research and Development in Information Retrieval.Page:167-185.1984.
  • 6Jeffrey Dean,Monika R.Henzinger.Finding related pages in the World Wide Web.Computer Networks,31(11-16):1467-1479,May 1999.
  • 7Deng Dan. Research on Chinese-English word alignment[D]. Institute of Computing Technology Chinese Academy of Sciences, Master Thesis. (in Chinese). 2004.
  • 8Zhao H. , W. Meng, Z. Wu, V. Raghavan, C. Yu. Automatic Extraction of Dynamic Record Sections from Search Engine Result Pages[C]//Proeeedings of the 32nd International conference on Very large databases. 2006.
  • 9Resnik Philip and Noah A. Smith. The web as a Parallel Corpus [J]. Computational Linguistics, 2003,29 (3):349-380.
  • 10Zhang Ying, Ke Wu, Jianfeng Gao, Phil Vines. Automatic Acquisition of Chinese-English Parallel Corpus from the Web[C]//Proceedings of 28th European Conference on Information Retrieval. 2006.

共引文献29

同被引文献51

引证文献5

二级引证文献12

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部