期刊文献+

一种基于生语料的领域词典生成方法 被引量:11

Method of Special Domain Lexicon Construction Based on Raw Materials
下载PDF
导出
摘要 为了实现准确分词,实用的汉语信息处理系统都需有其专用的领域词典.针对现有词典构造方法存在的不足,本文提出了一种领域词典的构造方法:利用通用词典对领域生语料进行分词处理,并提出了基于切分单元的最大匹配算法,从而得到候选词串集,然后利用规则对其进行优化,最终生成领域词典.词典的生成过程基本上是自动完成的,人工干预少,易于更新;目前,本方法生成的领域词典已经应用于我们自主开发的"基于Web的智能答疑系统"中,并取得了较好的效果. Special domain lexicon is very vital to any practical Chinese information processing system, especially to Chinese word segmentation. Aiming at the limitation of the current methods of special domain lexicon construction, a novel Chinese lexicon construction approach for word segmentation is proposed in this paper. It is based on a large amount of raw materials for some one special domain collected ahead, the longest repeated string patterns are extracted from each raw material after word segmentation based on open domain lexicon. Then, the non-meaningful words are trimmed to improve word extraction accuracy from possible candidate word set, moreover, using some optimization rules to filter the non-meaningful words further and finally the special domain lexicon is constructed. The proposed method has already been implemented and applied in our Web answering system. The experimental result shows it is practical, effective and extendable.
出处 《小型微型计算机系统》 CSCD 北大核心 2005年第6期1088-1092,共5页 Journal of Chinese Computer Systems
基金 国家自然科学基金项目(60373105)资助 国家"十五"重大科技攻关项目(2001BA101A01)资助 教育部优秀青年教师基金项目资助.
关键词 领域词典 通用词典 词频统计 最大匹配 special domain lexicon open domain lexicon word frequency maximum match
  • 相关文献

参考文献11

二级参考文献22

共引文献214

同被引文献67

引证文献11

二级引证文献26

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部