摘要
由于Lucene自带的ChineseAnalyzer和CJKAnalyzer两种中文分析器不能够满足全文检索系统的应用,本文给出了一种新的中文分词算法,用于改进Lucene中文分析器。该算法基于字符串匹配原理,实现了正向和逆向相结合的最大增字匹配分词算法。通过实验仿真,比较改进后的分析器与Lucene自带的两种分析器在分词效果和效率上的差异。结果显示,改进后的分析器分词效果明显优于Lucene自带的两种分析器,提高了全文检索系统的中文处理能力,系统的查全率和查准率都达到用户的需求。
A new Chinese algorithm to improve Lucene Chinese analyzer is proposed, since Chinese Analyzer and CJK Analyzer cannot meet the requirement of full-text searching application. This algorithm is based on the character string rule and combines the forward and reverse to achieve the largest adding word matching algorithm. The difference between improved analyzer and the two Lucene analyzers is compared through experiment simulation. It can be concluded that the improved analyzer is more effective than the other two analyzers. The system ratio and precision ratio meet the users' requirement.
出处
《青岛大学学报(自然科学版)》
CAS
2011年第3期53-58,共6页
Journal of Qingdao University(Natural Science Edition)
基金
国家支撑计划项目(2006BA111B07)