期刊文献+

基于子词的历史典籍术语对齐方法 被引量:1

Sub-Word Based Translation Extraction for Terms in Chinese Historical Classics
下载PDF
导出
摘要 由于历史典籍术语存在普遍的多义性且缺少古汉语分词算法,使用基于双语平行语料的对齐方法来自动获取典籍术语翻译对困难重重。针对上述问题,该文提出一种基于子词的最大熵模型来进行典籍术语对齐。该方法结合两种统计信息抽取频繁在一起出现的字作为子词,使用子词对典籍进行分词,解决了缺少古汉语分词算法的问题。针对典籍术语的多义性,根据典籍术语的音译模式制定音译特征函数,并结合其他特征使用最大熵模型来确定术语的翻译。在《史记》双语平行语料上的实验表明,使用子词的方法远远优于未使用子词的方法,而结合三种特征的最大熵模型能有效的提高术语对齐的准确率。 It is difficult to extract term translation pairs from the parallel corpus of historical classics due to lack of proper word segmentation for ancient Chinese. In this paper we introduce a term alignment method using maximum entropy model based on sub-words. In our approach,we first extract word pairs as sub-words by ehi-square statistics and log-likelihood ratio test, and apply them to segment Chinese. Then we build transliteration features according to the transliteration model of classics terms, and perform term alignment through maximum entropy. The use of sub words addresses the lack of word segmentation method for ancient Chinese and the maximum entropy model integra- ting three kinds of features deals with the polysemy of terms. The experiments on the parallel corpora of Shi Ji show the effectiveness of the sub-words by a large improvement in performance compared to the IBM Model 4.
作者 车超 郑晓军
出处 《中文信息学报》 CSCD 北大核心 2016年第3期46-51,共6页 Journal of Chinese Information Processing
基金 国家自然科学基金(61402068 61304206)
关键词 子词 术语对齐 最大熵模型 音译特征 sub-words term alignment maximum entropy model transliteration
  • 相关文献

参考文献13

  • 1Huang Fei,Vogel Stephan,Waibel Alex. Automatic extraction of named entity translingual equivalence based on multi-feature cost minimization[C]//Proceedings of the Workshop on Multilingual and Mixed-language Named Entity Recognition,Sapporo,Japan,2003: 184-192.
  • 2陈钰枫,宗成庆,苏克毅.汉英双语命名实体识别与对齐的交互式方法[J].计算机学报,2011,34(9):1688-1696. 被引量:17
  • 3Yufeng Chen,Chengqing Zong. A Semantic-Specific Model for Chinese Named Entity Translation[C]//Proceedings of the 5th International Joint Conference on Natural Language Processing,Chiang Mai,Thailand,2011: 138-146.
  • 4Y. Al-Onaizan,K. Knight. Translating named entities using monolingual and bilingual resources[C]//Proceedings of the 40th Annual Meeting on Association for Computational Linguistics,2002: 400-408.
  • 5R.C. Moore. Learning translations of named-entity phrases from parallel corpora[C]//Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics,2003: 259-266.
  • 6Chun-J en Lee,Jason S Chang,Jyh-Shing R. Jang. Alignment of bilingual named entities in parallel corpora using statistical models and multiple knowledge sources[J]. ACM Transactions on Asian Language Information Processing (TALIP),2006,5(2): 121-145.
  • 7陈怀兴,尹存燕,陈家骏.一种命名实体翻译等价对的抽取方法[J].中文信息学报,2008,22(4):55-60. 被引量:10
  • 8留金腾,宋彦,夏飞.上古汉语分词及词性标注语料库的构建——以《淮南子》为范例[J].中文信息学报,2013,27(6):6-15. 被引量:23
  • 9Donghui Feng,Yajuan Lv,Ming Zhou. A new approach for English-Chinese named entity alignment[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing(EMNLP),Barcelona,Spain,2004: 372-379.
  • 10赵海,揭春雨.基于有效子串标注的中文分词[J].中文信息学报,2007,21(5):8-13. 被引量:26

二级参考文献65

共引文献97

同被引文献78

引证文献1

二级引证文献10

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部