摘要
随着互联网和社会的飞速发展,新词不断涌现。识别和整理这些新词语,是中文信息处理中的一个重要研究课题。提出一种新词识别方法,该方法利用基于PAT-Array的重复字符串抽取候选串,提高了新词的召回率。并在此基础上分析新词内部模式,添加了垃圾串过滤机制。单字串过滤主要是运用垃圾词典的方法,多字词模式新词的确定是利用改进的互信息与独立成词概率结合的方法。由此,大幅度提高了新词识别的准确率。
With the rapid development of internet and society,new words are emerging.Identifying and organizing these new words,is an important research topic of Chinese information processing.This paper presents a new word recognition method via using PAT-Array repeated extractions of candidate strings to improve the recall of new words.Based on this method,analyses the internal model of new words and adds a garbage string filtering mechanism.Use the garbage dictionary to filter the single string.The improved mutual information is combined with a separate word combination methods to determine more new words.Our achievements can significantly improve the accuracy of new word recognitions.
出处
《河北省科学院学报》
CAS
2014年第2期35-40,共6页
Journal of The Hebei Academy of Sciences