摘要
【目的】对获取的双语语料进行分类,对分类后的双语语料进行句子对齐处理,生成领域平行语料。【方法】利用基于SVM算法的文本分类器对获取的中英双语语料进行分类。使用长度法和词汇法相结合的句子对齐工具对分类后的语料进行句子对齐工作,为提高句子对齐的正确率,利用人工对齐的中英平行语料计算中英文句子长度参数,结合中英双语词典,获取高质量的专业领域平行语料。【结果】使用该方法,对每个领域语料进行句子对齐后,取得95.45%的句子对齐正确率。计算得到的句子平均长度比为1.7777,方差为1.2640。【局限】由于双语语料的初始对齐程度比较好,因此句子对齐正确率可能不具有普遍代表性。【结论】从实验结果看,该方法是有效的,能够获取质量令人满意的领域平行语料。
[Objective] To automatically obtain domain parallel corpora via classified bilingual corpora and sentence alignment. [Methods] Classify bilingual corpora based on text classification technology, use sentence alignment tool to align classified bilingual corpus based on length information of bilingual sentence and bilingual dictionary. This paper uses artificial aligned bilingual corpora to calculate length parameters. [Results] The results obtain 95.45% rate of sentence aligned correctly. The length mean is 1.7777 and variance is 1.2640. [Limitations] Due to the extent of the initial alignment of bilingual corpus is satisfied, so the result of alignment is not universally representative. [Conclusions] The result proves the method presented in this paper is effective, so this method can acquire high quality domain parallel corpora.
出处
《现代图书情报技术》
CSSCI
北大核心
2014年第12期36-43,共8页
New Technology of Library and Information Service
关键词
句子对齐
文本分类
平行语料
机器翻译
Sentence alignment Text classification Parallel corpora Machine translation