摘要
新词识别是影响搜索准确率以及速率的重要因素.本文提出了一种基于统计模型和词语搭配的中文新词自动识别方法.采用条件概率的方法提取单字词搭配特征和临界词特征,并采用层次结构实现新词定位以及识别.首先采用双向最大匹配相结合的方法对文本进行词法粗切分,然后根据单字词搭配得到候选新词的位置,用临界词方法确定候选新词的边界,采用改进Nagao串频统计方法对新词候选词在本文内进行重复串统计,对于只在文中出现一次的新词则借助搜索引擎进行确定.对新浪网近期的网络文章进行测试,结果表明,基于本文方法设计的系统可以识别不同领域的新词,在低频词、较长的词以及新词语搭配方面取得了良好的效果.单字词搭配检查发现新词位置综合指标F值达到96.8%.
New word recognition have vital effect on precision and speed of search engine.This paper presents a hybrid method for automatic new word recognition based on a statistical model and search engine.It adopts conditional probability for collocation extraction.The method consists of four steps for new word detection and recognition:after segmenting the corpus based on bi-direction matching method,it first detects new words with collocation of one character word,and then determines new words candidates set with boundary words,and then searches for repeated strings with improved Nagao frequency statistics methods,and finally recognizes new words based on search engine dictionary.Experiment results show that the system built on the proposed method can find new words in any field,especially low frequency words,long words,and new collocation.The average of F-measure is 96.8% in locating new words.
出处
《武汉大学学报(理学版)》
CAS
CSCD
北大核心
2010年第6期704-710,共7页
Journal of Wuhan University:Natural Science Edition
关键词
新词识别
单字词
临界词
搭配抽取
搜索引擎
new word recognition
one character word
boundary word
collocation extraction
search engine