摘要
由于历史典籍术语存在普遍的多义性且缺少古汉语分词算法,使用基于双语平行语料的对齐方法来自动获取典籍术语翻译对困难重重。针对上述问题,该文提出一种基于子词的最大熵模型来进行典籍术语对齐。该方法结合两种统计信息抽取频繁在一起出现的字作为子词,使用子词对典籍进行分词,解决了缺少古汉语分词算法的问题。针对典籍术语的多义性,根据典籍术语的音译模式制定音译特征函数,并结合其他特征使用最大熵模型来确定术语的翻译。在《史记》双语平行语料上的实验表明,使用子词的方法远远优于未使用子词的方法,而结合三种特征的最大熵模型能有效的提高术语对齐的准确率。
It is difficult to extract term translation pairs from the parallel corpus of historical classics due to lack of proper word segmentation for ancient Chinese. In this paper we introduce a term alignment method using maximum entropy model based on sub-words. In our approach,we first extract word pairs as sub-words by ehi-square statistics and log-likelihood ratio test, and apply them to segment Chinese. Then we build transliteration features according to the transliteration model of classics terms, and perform term alignment through maximum entropy. The use of sub words addresses the lack of word segmentation method for ancient Chinese and the maximum entropy model integra- ting three kinds of features deals with the polysemy of terms. The experiments on the parallel corpora of Shi Ji show the effectiveness of the sub-words by a large improvement in performance compared to the IBM Model 4.
出处
《中文信息学报》
CSCD
北大核心
2016年第3期46-51,共6页
Journal of Chinese Information Processing
基金
国家自然科学基金(61402068
61304206)
关键词
子词
术语对齐
最大熵模型
音译特征
sub-words
term alignment
maximum entropy model
transliteration