摘要
基于统计的句子对齐是根据双语句子的长度在所有可能的对齐中找出概率最大的一个·提出两种对齐汉英语料的参数计算方法,使对齐模型中的评价函数满足标准正态分布·参数s2的值是对训练语料中的所有点(l1,(l2-cl1)2)进行线性回归分析所得直线的斜率,s2的另一种求法是直接计算方差·实验结果表明汉英法律文献亚句子级对齐的正确率为98 8%,召回率为99 2%·
Sentence alignment based on statistical approach is the choice of alignment with maximum probability from all candidates according to the length of bilingual sentences. ChineseEnglish law literature is translated literally, so it is suitable to be aligned with statistical approach. But the method used to compute the parameters in processing IndoEuropean languages cannot be applied to ChineseEnglish corpora. Two parameter computation methods for aligning ChineseEnglish corpora were presented. The method make the evaluation function satisfy the standard normal distribution. One method to get the parameter s2 is to compute slope of the line generated by linear regression analysis to all point (l1,(l2-cl1)2) in the training corpora. The other is to compute the variance. Test results show that the precision rate and recall rate of alignment are 98.8% and 99.2 % respectively.
出处
《东北大学学报(自然科学版)》
EI
CAS
CSCD
北大核心
2003年第1期23-26,共4页
Journal of Northeastern University(Natural Science)
基金
国家自然科学基金资助项目(60083006)
国家重点基础研究发展规划资助项目(G19980305011).
关键词
双语语料库
汉英法律文献
亚句子级对齐
统计方法
评价函数
参数计算
标准正态分布
汉语
英语
机器翻译
bilingual corpora
Chinese-English law literature
sub-sentence alignment
statistical approach
evaluation function
parameter computation
standard normal distribution