摘要
傣文自动分词是傣文信息处理中的基础工作,是后续进行傣文输入法开发、傣文自动机器翻译系统开发、傣文文本信息抽取等傣文信息处理的基础,受限于傣语语料库技术,傣文自然语言处理技术较为薄弱。本文首先对傣文特点进行了分析,并在此基础上构建了傣文语料库,同时将中文分词方法应用到傣文中,结合傣文自身的特点,设计了一个基于音节序列标注的傣文分词系统,经过实验,该分词系统达到了95.58%的综合评价值。
Daiwen word segmentation is the basis for Daiwen information processing work. It's the basic work for Daiwen input method, Daiwen machine translation system development, daiwen text information extraction and oth- er information processing words. Limited by Daiwen corpus technology, Daiwen natural language processing tech- nology is relatively weak. This paper first analyzes the characteristics of Daiwen, and on this basis, build a Daiwen corpus, then, applied Chinese word segmentation method to Daiwen segmentation, combined with its own charac- teristics, Designed an Daiwen word segmentation system based on the sequence annotation. Through experiments, the segmentation system has reached a comprehensive appraisal 95.58%.
出处
《中文信息学报》
CSCD
北大核心
2013年第6期187-191,共5页
Journal of Chinese Information Processing
基金
国家自然科学基金资助项目(61273288
61233009
61203258
61305003
61332017
61375027)
中国-新加坡数字媒体研究院基金(CSIDM)资助项目
关键词
傣文
分词
CRF
绝对切分词
Daiwen
segmentation
CRF
absolute segmentation word