摘要
基于当前最流行的全文检索引擎架构Lucene,文章设计并实现了一个中文分词模块。分词模块中的核心算法是基于字符串匹配与统计相结合的中文分词算法,其主要研究目的在于寻找更为有效的中文词汇处理方法,提高全文检索系统的中文处理能力。通过实验发现,该模块的分词准确率较高,分词速度有进一步的提升空间。下一步我们将通过多种改进措施,来完善该分词模块,最终构建出一个高效的中文全文检索系统。
Based on the most popular Lucene Information Retrieval Library, the design and implementation of a new Tokenizer targeted at Chinese are described in this article. The core algorithm of this Tokenizer is the Chinese word segmentation algorithm based on the matching of string and the combination with statistical & probability model. The main purpose of this research is to find a more efficient Tokenizer for Chinese language, thus increase the processing ability of the full text retrieval in Chinese. The experiments verify the high performance and accuracy of this Tokenizer in certain areas(e-Commerce) compared with other popular Tokenizer used for Chinese language. The algorithm will be further improved to get a more efficient Chinese Tokenizer for general purpose.
出处
《电子技术(上海)》
2012年第9期54-56,共3页
Electronic Technology