摘要
歧义切分技术是中文自动分词系统的关键技术之一·特别是在现代汉语通用分词系统(GPWS)中,允许用户动态创建词库、允许多个用户词库同时参与切分,这给歧义切分技术提出了更高的实用性要求·从大规模的真实语料库中,考察了歧义(特别是交集型歧义)的分布情况和特征;提出了一种改进的正向最大匹配歧义字段发现算法;并根据GPWS的需求,提出了一种“规则+例外”的实用消歧策略·对1亿字《人民日报》语料(约234MB)中的交集型歧义字段进行了穷尽式的抽取,并随机的对上述策略进行了开放性测试,正确率达99%·
Disambiguation is one of the most important parts of segment systems in Chinese. A Chinese general-purpose word segmentation (GPWS) system demands higher capacity of disambiguation techniques particularly, because it has functions such as allowing users to create their own dictionaries dynamically and employing multiple user' s dictionaries to word segmentation. Based on inspection of the distributions and characteristics of ambiguity fragments (especially overlapping ambiguity fragments) in large-scale real corpus, an improved forward maximum match algorithm for ambiguity fragment detection, as well as a practical " rules + exceptions" disambiguation strategy, are proposed in this paper. An exhaustive extraction has been made of the overlapping ambiguity sections (about 2.4 million occurrences) from a People's Daily corpus of 100 million characters (234MB approximately), and open-ended experiments on the above strategy randomly were carried out, which achieved accuracy average of 99 %.
出处
《计算机研究与发展》
EI
CSCD
北大核心
2006年第6期1122-1128,共7页
Journal of Computer Research and Development
基金
国家自然科学基金项目(60272055)
国家"八六三"高技术研究发展计划基金项目(2001AA114111)
教育部科学技术研究重点基金项目(00128)
教育部人文社会科学重点研究基地重大项目(02JAZJD740007)~~
关键词
中文信息处理
通用分词系统
歧义切分
Chinese information processing
general-purpose word segmentation system
disambiguation