摘要
藏语分词词典是藏语自动分词系统的重要基础,词典规模大小和算法设计的优劣直接影响着分词的效率。本项目首先收集了多部藏语字、词典的所有词条及藏语标点符号,形成了约10万词条的大型藏语分词词库;根据藏字不同长度的特点,建立了藏语特有的多级索引分词词典机制,分析设计藏语整词二分法进行藏语分词。实验结果表明该藏语分词词典具有结构简单,分词速度快和查询性能高等优点。
Tibetan word segmentation dictionary is the vital basis of the system of Tibetan automatic word segmentation, with the scale of the dictionary and the arithmetic design directly related to the efficiency of the word segmentation. This project firstly collected all the Tibetan vocabulary entries and punctuations from many dictionaries, and form an enormous Tibetan word storeroom with about 100 000 vocabularies. Secondly, a unique Tibetan multi-level index word segmentation mechanism had been founded to analyze and design Tibetan who/e-word dichotomy for Tibetan word segmentation according to the characteristic of Tibetan words with different length. The experimental results indicate that the Tibetan word segmentation dictionary has the advantages of simple structure, quick word segmentation, high inquiry capability, etc.
出处
《计算机应用》
CSCD
北大核心
2009年第B06期178-180,共3页
journal of Computer Applications
基金
中国科学院自动化研究所模式识别国家重点实验室开放课题
国家863计划项目(AA2006010101)
关键词
藏语分词
分词词典
藏语整词二分法
多级索引
Tibetan word segmentation, word segmentation dictionary, Tibetan whole-word dichotomy, multi-level index