摘要
为了实现多语种词对齐,该文提出一种以点互信息为基础的翻译概率作为改进的多语种单词关联强度度量方法。首先,论证了在服从Zipf定律的普通频级词区域,单词间关联强度的点互信息度量法可简化为翻译概率;其次,对汉语、英语、朝鲜语平行语料进行句子对齐、分词和去停用词等预处理后计算平行语料单词之间的翻译概率,取翻译概率最高的前k个词作为候选翻译词,并通过优化处理提高了词对齐准确率。实验结果表明,该方法可以不完全依赖语料规模,在小规模语料中取得94%以上的准确率,为跨语言小众文献及低资源语言词对齐提供了技术基础。
In order to achieve multi-language word alignment,an improved multi-language word relevance measure based on PMI translation probability is proposed.Firstly,it is proved that the PMI measure method of the correlation strength between words can be simplified to translation probability in the region of ordinary frequency grade words obeying Zipf’s law.Secondly,the translation probability between parallel corpus words is calculated after pre-processing of Chinese,English and Korean parallel corpus,and the top k words with the highest translation probability are chosen as candidate translation words.Further optimization is applied to improve the word alignment accuracy.The experimental results show that this method can obtain more than 94%accuracy in small-scale corpus,which provides a solution to the low-resource language word alignment.
作者
杨飞扬
赵亚慧
崔荣一
易志伟
YANG Feiyang;ZHAO Yahui;CUI Rongyi;YI Zhiwei(Intelligent Information Processing Laboratory,Department of Computer Science&Technology,Yanbian University,Yanji,Jilin 133002,China)
出处
《中文信息学报》
CSCD
北大核心
2019年第12期37-44,共8页
Journal of Chinese Information Processing
基金
国家语委“十三五”科研规划项目(YB135-76)
延边大学外国语言文学世界一流学科建设科研项目(18YLPY13,18YLPY14)
关键词
词对齐
平行语料
翻译概率
Zipf定律
word alignment
parallel corpus
translation probability
Zipf’s law