摘要
在面向机器翻译的语料库建设过程中,基于长度的汉维句子对齐和长度相似度算法大多都以字符作为汉维句子长度的计算单位。但是,还有其他计算方法值得尝试。对长度计算单位的4种组合进行统计与实验分析,以确定汉维句子长度计算的最佳单位,最终提高汉维句子对齐的准确率。双语句子汉语字符数和维文词数之间相关系数较高,句子长度比值近似于正态分布。根据实验结果,汉文字符和维文单词是汉维句子对齐的最佳长度计算单位,句子对齐的准确率和召回率最高,分别达到94%和93.6%。
During the corpus construction for machine translation,most of the available length-based Chinese-Uyghur sentence alignment methods take characters as their sentence length computation unit,but there are other units can be used to calculate the length of the sentences.Four different combination of sentence length computation methods are compared on the base of statistical analysis and experiments in or?der to determine a best unit of sentence length and to improve the accuracy of sentence alignment.The number of Chinese characters and the number of Uyghur words in the translated sentences are highly correlated,and the sentence length ratio is more similar to the normal distribution.The experiment results also show that Chinese characters and Uyghur words are the best length calculation units for Chinese-Uyghur sentence alignment,and the precision and recall of the sentence alignment are highest,reaching94%and93.6%respectively.
作者
塞麦提.麦麦提敏
吐尔根.伊布拉音
SAMAT Mamitimin;TURGUN Ibrahim(Xinjiang University, Urumqi 830046)
出处
《现代计算机》
2018年第22期8-11,16,共5页
Modern Computer
基金
国家社科基金项目(No.17XYY034)
教育部人文社科青年项目(No.16XJJC740001)