摘要
为了实现准确分词,实用的汉语信息处理系统都需有其专用的领域词典.针对现有词典构造方法存在的不足,本文提出了一种领域词典的构造方法:利用通用词典对领域生语料进行分词处理,并提出了基于切分单元的最大匹配算法,从而得到候选词串集,然后利用规则对其进行优化,最终生成领域词典.词典的生成过程基本上是自动完成的,人工干预少,易于更新;目前,本方法生成的领域词典已经应用于我们自主开发的"基于Web的智能答疑系统"中,并取得了较好的效果.
Special domain lexicon is very vital to any practical Chinese information processing system, especially to Chinese word segmentation. Aiming at the limitation of the current methods of special domain lexicon construction, a novel Chinese lexicon construction approach for word segmentation is proposed in this paper. It is based on a large amount of raw materials for some one special domain collected ahead, the longest repeated string patterns are extracted from each raw material after word segmentation based on open domain lexicon. Then, the non-meaningful words are trimmed to improve word extraction accuracy from possible candidate word set, moreover, using some optimization rules to filter the non-meaningful words further and finally the special domain lexicon is constructed. The proposed method has already been implemented and applied in our Web answering system. The experimental result shows it is practical, effective and extendable.
出处
《小型微型计算机系统》
CSCD
北大核心
2005年第6期1088-1092,共5页
Journal of Chinese Computer Systems
基金
国家自然科学基金项目(60373105)资助
国家"十五"重大科技攻关项目(2001BA101A01)资助
教育部优秀青年教师基金项目资助.
关键词
领域词典
通用词典
词频统计
最大匹配
special domain lexicon
open domain lexicon
word frequency
maximum match