摘要
面向汉文维吾尔文(以下简称汉维)双语科技术语抽取这一应用目标,本文提出新闻科技领域的汉维可比语料库设计方案并进行实验.将网络采集的汉维语料利用机器翻译系统进行初加工后映射到向量空间中并使用LSI算法计算出各向量间的相关性,利用计算后的向量建立索引并依次计算出源文本与候选文本的相似值.本文设计两种实验进行对比,对选取的可比语料进行评估、筛选,最终达到构建汉维可比语料库的目的.
In order to realize the practical requirement of Chinese-Uyghur bilingual scientific and technical terms, this paper proposes a Chinese-Uyghur comparable corpus design for the field of news, science and technology and carries out a feasibility experiment. It is first proposed to use more mature Chinese-Uyghur machine translation system to establish the Chinese-Uyghur comparable corpus. We use the Chinese-Uyghur corpus collected on the network to map the collected corpus to the vector space and use the LSI algorithm to compute the correlation between the words. The calculated text is indexed as candidate text and then the similarity between the source text and the candidate text is calculated in turn. Furthermore, two experimental schemes are designed and compared, and the selected corpus is evaluated and screened to achieve the goal of constructing the Chinese-Uyghur comparable corpus.
出处
《新疆大学学报(自然科学版)》
CAS
北大核心
2017年第3期316-321,共6页
Journal of Xinjiang University(Natural Science Edition)
基金
国家自然科学基金项目(61463048
61462083
61331011)
国家重点基础研究发展计划(973)项目(2014cb340506)
关键词
可比语料库
汉维可比语料库构建
双语术语抽取
LSI
comparable corpora
Chinese-Uyghur bilingual corpora construction
bilingual language term extraction
LSI