摘要
提出一种基于虚词停顿的中文分词消岐的模型。首先利用建立的虚词知识库对文本进行粗分词-划分停顿,然后对句子中停顿间的短语用双向最大匹配再进行分词,提取歧义部分,最后使用N-Gram模型和数据平滑等技术处理。整个过程分为粗分词、精分词和歧义消除三个过程。测试结果显示,该模型能有效地降低词歧义引起的错误切分率。
This paper puts forward a model which can eliminate sense ambiguity of Chinese segmentation based on the pause of the empty words. Firstly, this model segments words roughly based on the empty words library and then it has many phrases between pauses. Secondly, they segment phrases based on MM and RMM and extract the ambiguity. Finally, they have the model of N-Gram and the technology of the data smoothing to improve it. The process can be divided into three parts : segments word roughly, segments word nar rowly and disambiguation. The test result shows that this model is able to reduce the error rate of segmentation ,which is caused by the ambiguity of word segmentation.
出处
《图书情报工作》
CSSCI
北大核心
2010年第14期121-125,共5页
Library and Information Service
基金
广西教育厅科研项目"基于中文自然语言理解的智能检索技术研究"(项目编号:桂科目0991254)
广西研究生教育创新计划资助项目"面向对象的汉语语义网络模型的研究"(项目编2008105960812M18)的研究成果之一
关键词
分词
停顿
最大匹配
N-GRAM模型
数据平滑
word segmentation pause word maximum matching method N-Gram model data smoothing