摘要
针对目前应用于搜索引擎Lucene的中文分析器的分词不符合汉语习惯的现状,根据正向最大匹配切分算法和采用包括基本标准中文词语的词库,实现了自己的分析器。该分析器的分词结果更符合汉语的习惯,并且在分词、建立索引等方面的性能非常接近基于机械分词的分析器,另外在检索速度方面性能提升了2~4倍,在检索召回率方面性能提升了59%。
The word segmentation algorithm of most Chinese analyzers for the Lucene search engine does not meet the Chinese habit.In order to overcome such deficiency,this paper has proposed a new Chinese analyzer based on the maximal match algorithm and a standard dictionary.From the experimental results,the proposed word segmentation algorithm of our Chinese analyzer meets the Chinese habit.And its indexing performance is very close to that of the analyzers based on mechanical segmentation.In addition,the retrieval efficiency is greatly improved by 2-4 times and the rate of retrieval response is improved by 59%.
出处
《计算机工程与应用》
CSCD
北大核心
2009年第12期157-159,共3页
Computer Engineering and Applications
基金
国家自然科学基金~~
关键词
分析器
索引
检索
分词
搜索引擎
analyzer
index
retrieval
word segmentation
search engine