摘要
藏文语料库是计算机研究藏语规律的素材,建设藏语语料库是藏文信息处理研究的基础和前提。语料库建设中样本的类别号是识别样本类别的依据,也是关联藏文语料信息库、样本文档和使用者的纽带,在语料库建设中非常重要。文章结合藏语平衡语料库的建设,设计了藏文语料数据库,划分了藏文语料库的类别并设计实现了样本类别号的产生方法。
The Tibetan language corpus is the material for researching the regular pattern of Tibetan language by computer,and the constructing the Tibetan language corpus is the baseline and the presupposition for research of the Tibetan information processes.The type number of samples is very important in the construction of the corpus and is the baseline of identifying the samples and a link of connecting between the related Tibetan language corpus repository,the sample document and users.In the present paper,the database of the Tibetan language corpus was designed and the categories of the Tibetan language corpus was recognized combining with the constructing the Tibetan language balanced corpus,and the method of producing the samples type number was designed and realized as well.
基金
2011年度国家自然科学基金项目"基于虚词的藏语基本句型的形式化研究"(项目号:61063015)
2011年度国家自然科学基金项目"藏语依存树库的构建"(项目号:61163043)
2005年度年度国家语委项目"大型藏文基础语料库建设"(项目号:MZ115-039)
2011年度西藏自治区科技计划项目"基于语料库的藏语词汇计量研究"阶段性成果
关键词
藏语
语料库
样本
类别号
Tibetan language
corpus
samples
type number