摘要
提出了一个基于英汉双语口语对齐语料库的翻译词典的自动生成算法 .首先利用释义词典过滤双语文本 ,得到“过滤词典” ,继而通过统计共现概率 ,计算出所有词对的相互关联值 ,并且生成“汉英 (英汉 )相互关联值表” ,对于每个源语词汇选取相互关联值最大的若干项目标语作为候选词对 ,分别赋予信任值 1,然后统计每个候选词对的信任值作为翻译词典的分级标准 ,得到 4个不同级别的词典 ,其中“过滤词典 +4级词典”在召回率为 93 5 %的情况下 ,正确率达到 93 389% .
This paper described an algorithm for automatic construction of English-Chinese translation lexicon from sentence aligned parallel spoken language corpus. The first part of the translation lexicon is get by using the electronic dictionary to filter the corpus. Secondly, authors count the co-occurrence probability and calculate the association score of the word pairs to produce The Table of Chinese-English (English-Chinese) Words Co-occurrence Association Score. Then, for each word pairs in the four tables, give 1 as the confidence score if the word pair's co-occurrence association score is the top five for each source word. Then, use the confidence score as the criterion for constructing 4 levels translation lexicons. The filtered lexicon and the 4th level lexicon get the precision of 93.389% and the recall of 93.5%. This is an inspiring result, because it is based on the Indo-European and the non-Indo-European spoken language corpus. In this algorithm, the grading of the lexicon can deduce effectively the number of the incorrect entries in the high level lexicon, which makes the translation lexicon more practicable, and solves the problem of the balance of the precision and recall.
出处
《计算机学报》
EI
CSCD
北大核心
2003年第3期275-280,共6页
Chinese Journal of Computers