摘要
在藏文文本理解中虚词发挥着重要的句法、语义桥接作用,其规则的有效性在藏文分词处理中扮演着特殊的角色。由于虚词本身及其角色的丰富性,在一定意义上可以说藏文分词处理是虚词识别的过程。因此,虚词识别的正确与否直接影响着藏文文本分词处理的效果。文章依据藏语自身的语法规律和虚词功能的特殊性,首先构建了虚词知识库、虚词兼类库,以及其作为藏文连续文本中识别虚词的依据;其次,研制了标有词汇属性的分词词表和一定规模的训练语料库资源,以基于条件随机域(CRF)的方法进行词性标注,并结合虚词和词性赋码的资源制作了藏文自动分词赋码一体化处理的模型。
The function words have an important connection function of the syntax and the semantics in the understanding of Tibetan language text and its effectiveness of regulation also plays a special role in the Tibetan word processing. It can be said that the Tibetan word processing is a procedure of the function words identification in a certain sense, because of it has richness of function words and its rich role. Therefore, the correct identification of the function words directly impacts the effectiveness of Tibetan language text participle. According to the particularity of the Tibetan grammar rules and the role of the function words, in the present paper, firstly, a function word knowledge base, and simultaneous base of function words and a baseline of identification function words in the continuous Tibetan language text were constructed. Secondly, a participle word list of vocabulary attribute is produced and a Tibetan automatic POS tagging integration treatment model was achieved by the certain scale training corpus as a resources, the method of Conditional Random Fields (CRF) based proceeding speech tagging combining with the recourses of the function words and POS tagging.
基金
2011年度国家自然科学基金项目"藏语依存树库的构建"(项目号:61163043)
国家自然科学基金项目"基于虚词的藏语基本句型的形式化研究"(项目号:61063015)
教育部人文社会科学基金青年项目"现代藏文音节字的自动校对方法研究"(项目号:10YJCZH033)
国家语委项目"大型藏文基础语料库构建"(项目号:MZ115-039)
2011年度西藏自治区一般科技计划项目"基于语料库的藏语词汇计量研究"阶段性成果
关键词
藏文
分词
词性赋码
Tibetan language
Participle
POS tagging