摘要
经过分词处理的大型汉语语料库是进行语言学和计算语言学研究的重要资源。一致性是衡量分词语料库质量的重要标准之一。本文列举了导致分词语料库出现不一致的主要结构类型,讨论了“语法词”与“心理词”的区别,指出分词语料库以切成“心理词”为宜。“心理词”的模糊性决定了严格意义的完全一致对分词语料库是不可能实现的,我们所追求的目标应调整为受控条件下的一致性。
Abstract The large scale word segmented corpus is an important resource for the study of both linguistics and computational linguistics. One of the criteria on the quality of corpus is its consistency. This paper discusses the major structural types which are likely to generate word segmentation inconsistencies, discriminates between the concepts of `linguistic word` and `psychological word`, and points out that the basic unit of segmented corpus would better be `psychological word`. We conclude that it is impossible to conduct a fully consistent word segmented corpus due to the fuzziness of `psychological word', and that our goal should be adjusted to seeking the consistency under controlled condition instead.
出处
《语言文字应用》
CSSCI
北大核心
1999年第2期90-93,共4页
Applied Linguistics
基金
中国国家自然科学基金