期刊文献+

统计与规则相结合的藏文句子自动断句方法 被引量:7

An approach of automatic segmentation for Tibetan sentence based on rules and statistics
原文传递
导出
摘要 藏文句子断句是藏文信息处理领域的难点之一,也是藏汉机器翻译、藏文文本分类等工作的一项重要基础性研究.提出了一种统计与规则相结合的藏文句子自动断句方法以解决藏文标点符号功能的歧义问题,实验结果表明该方法具有比较好的效果,F1值达到98%以上.在规则中首先使用经验的方法,识别出不确定的藏文句子作为候选句子,然后采用基于关联词的复句分析方法进行分句合并形成二次候选句子;最后使用最大熵的方法对二次候选句子进行断句.经验方法和复句分析有效解决了最大熵算法无法触及的语料稀疏和分句问题. Segmentation of Tibetan sentences is one of the difficult task in the area of Tibetan information processing, and is also one of the key foundational researches of Tibetan - Chinese Machine Translation, Text Cat- egorization, etc. To deal with the ambiguous functions of the Tibetan punctuations, this paper proposes a method of automatic segmentation of Tibetan sentences, which combines statistics and rules. The experiment shows that thisapproach works really well: the F1 - measure reaches 98 % and more. First, the experience method is used in rules to identify the ambiguous Tibetan sentences which are the candidate sentences. Then the analysis of com- pound sentences which is based on conjunctive words is used to combine clauses to form the further candidate sentences. Finally, the method of Maximum Entropy is used to cut up the further candidate sentences according to the meanings. Thus the experience method and the analysis of compound sentences have solved the problems of sparse corpus and clauses that Maximum Entropy cannot work out.
机构地区 西北民族大学
出处 《云南大学学报(自然科学版)》 CAS CSCD 北大核心 2012年第6期653-657,663,共6页 Journal of Yunnan University(Natural Sciences Edition)
基金 国家自然科学基金资助项目(61032008 60970071) 甘肃省自然科学基金资助项目(1107RJZA157)
关键词 藏文句子自动断句 复句分析 二次候选句子 最大熵 automatic segmentation of Tibetan sentences analysis of compound sentences further candidate sentences maximum entropy
  • 相关文献

参考文献10

二级参考文献36

共引文献74

同被引文献118

引证文献7

二级引证文献13

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部