摘要
该文将从汉藏法律法规和公文领域平行语料中提取双语短语对。考虑现阶段藏文资源匮乏,提出两步汉藏短语抽取方法。第一步是提取汉语有效语块,这部分工作不是该文工作重点。第二步是获取待翻译汉语短语的译文,该模块提出藏文词序列相交算法抽取藏文短语。该算法可以很好的抽取1-1和1-n连续和非连续藏文短语。
This paper describes a method to extract phrase pairs from domain-specific Chinese-Tibetan bilingual corpus of laws,regulations and official documents.So far,widely used phrase extraction methods heavily depend on the result of word alignment or additional resources like part-of-speech or syntactic analysis and so forth.Taking account of inadequate resources in Tibetan at present,this paper proposes a two-phase Chinese-Tibetan phrase pairs extraction method.The first step is to extract the Chinese phrase(multi-word chunk) using Nagao's Algorithm and Substring Reduction Algorithm.The second step is to extract the candidate Tibetan translation for translation-ready Chinese phrase.This paper proposes Tibetan words sequence intersection algorithm(TIA) to extract Tibetan phrase.TIA works well on both 1-1 translation and 1-n translation(either continuous or discontinuous) Tibetan phrase.
出处
《中文信息学报》
CSCD
北大核心
2011年第2期105-110,121,共7页
Journal of Chinese Information Processing
基金
中国科学院"西部行动计划高新技术项目"资助(KGCX2-YW-512)
关键词
汉藏短语抽取
藏文信息处理
中文信息处理
Chinese Tibetan phrase extraction
Tibetan information processing
Chinese information processing