摘要
该文针对小型词库,提出了基于规则统计模型的消歧方法和识别未登录词的词加权算法。通过大量语料库学习获取歧义高频字,作为歧义标记,利用规则统计模型对标记的上下文信息分类处理,剩下的部分进行正向或逆向动态最大匹配,对连续单字串使用词加权算法来判断其是否为未登录多字词。经过实验测试,该系统的准确率为98.88%,召回率为98.32%。
It is prerequisite to segment words for Chinese text understanding.This paper presents a novel method to it for a small vocabulary.It uses rule-statistic models to eliminate ambiguity and uses word-weight algorithm to recognize unknown words.The characters with high frequency ambiguity are extracted firstly,then the context of the extracted characters is dealed with according to the rule -statistic model.The others are segmented by a dynamic maximum matching approach.Unknown words are identified based on word-weight algorithm from a sequence of continuous single-character words.Finally,this paper further demonstrates the segmented results using the software prototype developed by authors based on the proposed approach,with a precision rate of98.88%,a recall rate of98.32%.Thus the approach is more effective and robust.
出处
《计算机工程与应用》
CSCD
北大核心
2004年第15期43-45,91,共4页
Computer Engineering and Applications
基金
国家973基础研究计划课题(编号:2002CB312103)
国家自然科学基金项目(编号:60373056)
国家自然科学基金重点项目(编号:60033020)