摘要
交集型歧义切分字段是影响汉语自动分词系统精度的一个重要因素。本文引入了最大交集型歧义切分字段的概念,并将之区分为真、伪两种主要类型。考察一个约1亿字的汉语语料库,我们发现,最大交集型歧义切分字段的高频部分表现出相当强的覆盖能力及稳定性:前4,619个的覆盖率为59.20%,且覆盖率受领域变化的影响不大。而其中4,279个为伪歧义型,覆盖率高达53.35%。根据以上分析,我们提出了一种基于记忆的、高频最大交集型歧义切分字段的处理策略,可有效改善实用型非受限汉语自动分词系统的精度。
The solution of crossing ambiguities is still an open issue in the study of Chinese word segmentation. In this paper, we introduce the concept of maximal crossing ambiguity at first, divide it further into two major types, i.e., the true and the pseudo. Having observed a Chinese corpus with 100M characters, we find that the high frequent part of maximal crossing ambiguities is strong in coverage capacity (the coverage of the top 4,619 is as high as 59.20%, out of which 4,279 belongs to the pseudo type, with coverage 53.35%) and rather stable with regard to domain shifting. As a consequence, we propose for high frequent maximal crossing ambiguities a memory-based strategy that is expected to improve the performance of practical Chinese word segmenters significantly.
出处
《中文信息学报》
CSCD
北大核心
1999年第1期27-34,共8页
Journal of Chinese Information Processing
基金
国家自然科学基金
关键词
中文信息处理
汉语自动分词
歧义切分字段
Chinese information processing Chinese word segmentation maximal crossing ambiguities with high frequency memory based disambiguation strategy