期刊文献+

基于多特征融合和图匹配的维汉句子对齐 被引量:2

Uyghur Chinese Sentence Alignment Based on Multi Features and Optimal Matching
下载PDF
导出
摘要 维吾尔语新闻网页与对应的中文翻译网页在内容上往往并非完全可比,主要表现为双语句子序列的错位甚至部分句子缺失,这给维汉句子对齐造成了困难。此外,作为新闻要素的人名地名很多是未登录词,这进一步增加了维汉句子对齐的难度。为了提高维汉词汇的匹配概率,作者自动提取中文人名、地名并翻译为维吾尔译名,构造双语名称映射表并加入维汉双语词典。然后用维文句中词典词对应的中文译词在中文句中进行串匹配,以避免中文分词错误,累计所有匹配词对得到双语句对的词汇互译率。最后融合数字、标点、长度特征计算双语句对的相似度。在所有双语句子相似度构成的矩阵上,使用图匹配算法寻找维汉平行句对,在900个句对上最高达到95.67%的维汉对齐准确率。 The content of Uyghur webpage news is usually partial comparable with the content of the Chinese counterpart.Uyghur sentence sequences may be shuffled or even partially missing in Chinese text,which cause some difficulties in mining parallel sentences(i.e.sentence bead)from bilingual news.Fist,to improve the word matching rate of this kind,person and location names in Chinese are extracted and translated into Uyghur to enhance bilingual mapping.Then we scan the Chinese sentences with translation of Uighur words and calculate the translation rate via string matching to avoid mistakes in Chinese word segmentation.The final similarity of a sentence pair is calculated by combining the word translation rate with the numbers,punctuations and length of sentences as features.Similarities of all the bilingual sentence pairs constructed a weight matrix.We used greedy algorithm and maximum weight matching algorithm in bipartite graph to find the parallel sentence pairs with highest probability.Our method achieves an accuracy of 95.67%in sentence alignment.
作者 倪耀群 许洪波 程学旗 Ni Yaoqun Xu Hongbo Cheng Xueqi(CAS Key Laboratory of Network Data Science & Technology, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China Department of Language Engineering, University of Chinese Academy of Sciences,Beijing 100049, China Department of Language Engineering, University of Foreign Languages, Luoyang, Henan 471003, China)
出处 《中文信息学报》 CSCD 北大核心 2016年第4期124-133,共10页 Journal of Chinese Information Processing
基金 国家自然科学基金(61232010 61303156) 国家973课题(2012CB316303) 国家863课题(2012AA011003) 国家科技支撑计划(2012BAH46B04)
关键词 句子对齐 人名、地名翻译 多特征融合 二部图最佳匹配 sentence alignment translation of human name and location name multiple features blending maximum weight matching in bipartite graph
  • 相关文献

参考文献3

二级参考文献31

  • 1吕学强,吴宏林,姚天顺.无双语词典的英汉词对齐[J].计算机学报,2004,27(8):1036-1045. 被引量:11
  • 2张艳,柏冈秀纪.基于长度的扩展方法的汉英句子对齐[J].中文信息学报,2005,19(5):31-36. 被引量:24
  • 3李维刚,刘挺,张宇,李生.基于长度和位置信息的双语句子对齐方法[J].哈尔滨工业大学学报,2006,38(5):689-692. 被引量:25
  • 4刘小虎,吴葳,李生,赵铁军,蔡萌,鞠英杰.基于词典和统计的语料库词汇级对齐算法[J].情报学报,1997,16(1):21-27. 被引量:8
  • 5罗智勇,宋柔.现代汉语自动分词中专名的一体化、快速识别方法[C]//Ji Dong-Hong.国际中文电脑学术会议,新加坡,2001:323-328.
  • 6Dolan W B,Pinkham J,Richardson S D.The Microsoft Research Machine Translation System[J].AMTA,2002:237-239.
  • 7Wu D,Xia X.Large-scale automatic extraction of an English-Chinese translation lexicon[J].Machine Translation,1995,9(3/4):285-313.
  • 8Fattah M A,Ren Fuji,Shingo K.Adaptive Threshold Parameters for Bilingual Dictionary Extraction from the Internet Archive[J].International Journa Information,2005,8(1):165-175.
  • 9Dejean H,Gaussier E,Sadat F.Bilingual Terminology Extraction:An Approach based on a Multilingual thesaurus Applicable to Comparable Corpora[C]//Proceedings of the 19th International Conference on Computational Linguistics COLING.Taipei,Taiwan,2002:218-224.
  • 10Chuang T C,Yeh K C.Aligning Parallel Bilingual Corpora Statistically with Punctuation Criteria[J].Computational Linguistics and Chinese Language Processing,2005,10(1):95-122.

共引文献25

同被引文献4

引证文献2

二级引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部