期刊文献+

自适应分词算法中的未登录词识别技术研究 被引量:5

Research on Unlisted Words Identification in Chinese Self-adaptive Segmentation
下载PDF
导出
摘要 深入研究了未登录词识别技术,并提出了一种新的未登录词识别算法,包括其中的数量词识别规则、边界单字规则、虚字辅助规则、未登录词记忆识别规则以及左右方探测法选取未登录词规则等,使得算法在不依赖大型语料库的前提下可以有效地识别多种领域中各种类型的未登录词。同时,算法通过对绝大部分的交集歧义的识别有效地解决了识别未登录词时导致的新的切分歧义的问题。在网络时文的开放性测试中,分词算法的分词准确率约为90.1%,未登录词识别的准确率、召回率分别为91.2%和94.7%。 This paper studied on the unlisted words identification.And then it came up with a new unlisted words identification algorithm which is composed of several rules,such as the rule of identification of numerals and quantifiers,auxiliary rules of border words,auxiliary rules of functional word,the rule of unlisted words identification based on memory and the rule of right or left detecting methods to identify unlisted words.At the same time,by comparing the results of the bi-directional segmentation algorithm,the algorithm identifies the most common crossing ambiguities to make identification of unlisted words and crossing ambiguities integrative.In an open evaluation of the latest web documents,the segmentation accuracy rate which is about 90.1%,accuracy rate and recall rate of the unlisted words identification is 91.2%and 94.7%.
作者 程冲 黄水清
出处 《情报学报》 CSSCI 北大核心 2009年第4期530-536,共7页 Journal of the China Society for Scientific and Technical Information
关键词 汉语分词 未登录词识别 交集型歧义 汉语分词系统 Chinese segmentation unlisted words identification crossing ambiguity Chinese segmentation system
  • 相关文献

参考文献11

二级参考文献77

共引文献258

同被引文献121

引证文献5

二级引证文献38

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部