摘要
在PAT数组的基础上,引入LCP数组记录文本后缀串的相同前缀长度,通过扫描LCP数组快速抽取文本高频词。该算法不依赖于分词词典,通过探测重复出现串来提取高频词,并能够抽取任意重复字符串,对新词、组合词抽取特别有效。实验结果表明,该算法抽取的高频词可以达到较高的可接受率,在与ICTCLAS系统关键词抽取的比较中也有较高的相同率,且在发现组合词方面更具优势。
Based on PAT array,introducing LCP array to count the length of the common prefixes of text suffixes, a new algorithm without thesaurus is presented for extracting high - frequency words of Chinese text by scanning LCP arrary. The algorithm does not depend on segmentation dictionary and can extract any repeated string, especially the new words and combined words. Experimental results show that high - frequency words extracted by the algorithm achieve a high accept- ance rate and this algorithm is more effective in extracting combined words than ICTCLAS.
出处
《现代图书情报技术》
CSSCI
北大核心
2012年第6期50-53,共4页
New Technology of Library and Information Service
关键词
中文信息处理
高频词抽取
PAT数组
中文分词
关键词分析
Chinese information processing High - frequency word extraction PAT array Chinese word segmentation Keyword detection