摘要
传统的中文分词就是识别出每个词的边界,它忽略了汉语中词与短语分界不清这一特点。在理论上,语言学家对词边界的确定往往各持己见,各语料库的分词标准不能统一,在实践中也不能完全满足具体应用的需求。该文给出了基于层叠CRF模型的词结构自动分析方法,能够以较高的精确度获得词的边界信息和内部结构信息。相比于传统的分词,词的结构分析更加符合汉语词法与句法边界模糊的事实,解决了语料库标准的不一致性以及应用的不同需求。
Traditional research in Chinese word segmentation focuses on identifying word boundaries, without con- sidering the ambiguity of boundaries between Chinese words and phrases. In theory, linguists stick to their own view of word boundaries such that no uniform standard exists in Chinese word segmentation, and in practice, the corpus of various guidelines cannot bring satisfactory reusltsto wide applications. In this paper, we present a model based on cascaded CRF models to automatically parse internal structures of words, deciding both word boundaries and internal structures simultaneously with high precision. Compared with the traditional word segmentation meth- ods, analyzing the structure of words is more consistent with the fact of fuzzy boundaries between Chinese lexical and syntactic units, solving the problem of inconsistent corpus standards and meeting different application require- ments.
出处
《中文信息学报》
CSCD
北大核心
2015年第4期1-7,24,共8页
Journal of Chinese Information Processing
基金
自然科学基金青年项目(61202162)
教育部博士点基金新教师类课题(20123201120011)
关键词
中文分词
内部结构
分词标准
层叠CRF
Chinese word segmentation
internal structure
annotation standard
cascaded CRFs