摘要
提出了一种错误抑制的多策略算法对齐汉维语句子。针对长度对齐算法无法避免错误蔓延的特点,提出了一种新的错误蔓延抑制策略:利用双语语料的词汇共现信息,自动抽取汉维语词汇搭配,结合句子长度特征,寻找1:1模式的句对作为锚点,将错误蔓延抑制在锚点内;在锚点之间,利用标点符号和长度混合方法进行句子对齐。算法实验结果验证了该多策略算法寻找的锚点的精度高,有效抑制了对齐错误的蔓延;采用的混合对齐算法,避免了基于词汇对齐算法的高时间复杂度的弱点,比传统的对齐算法性能有了较大提高,对齐准确率由95.0%提高到97.6%,召回率由96.8%提高到98.2%,采用的对齐正确性评价算法可以有效发现自动对齐中的噪音对齐。
This paper proposed a hybrid algorithm of sentence alignment in Chinese-Uyhur parallel corpora. Aiming at the shortcoming of mistake spread in alignment algorithm based on length, this paper presented a new kind of suppression strategy for mistake spread. By using csentence length and Chinese-Uyhur correspondence information, the anchor points with 1:1 pattern sentence pairs are identify to suppress mistakes spread. Among anchor points,a approach based on both length and punctuation is used to align sentences. Experimental results verify the high precision of identifying anchor points and the effective restraint of the spread of mistakes; Hybrid alignmentd algorithm avoids the weakness of high time complexity algorithms based on words. In addition, its performance is improved more compare with traditional alignment algorithms, and increase alignment aecuarey from 95.0 % to 97. 6 % and recall from 96. 8 % to 98. 2%, and. the validity evaluation method can find the noised alignment efficently.
出处
《计算机科学》
CSCD
北大核心
2010年第4期215-218,292,共5页
Computer Science
基金
国家自然科学基金项目(60663006
60963017)
新疆维吾尔自治区高等学校科学研究计划(XJEDU2009I05)资助
关键词
双语语料
错误抑制
句子对齐
混合策略
汉维句子
Bilingual corpora,Error curb,Hybrid strategy,Sentence alignment,Chinese-Uyhur sentence