摘要
提出了一种基于Lucene的中文全文检索系统模型。通过分析Lucene的系统结构,系统采用了基于统计的网页正文提取技术,并且加入了中文分词模块和索引文档预处理模块来提高检索系统的效率和精度。在检索结果的处理上,采用文本聚类的办法,使检索结果分类显示,提高了用户的查找的效率。实验数据表明,该系统在检索中文网页时,在效率、精度和结果处理等方面性能明显提高。
A system model for Chinese full text search engine based on Lucene is proposed. In order to improve the performance of Lucene system in searching Chinese web pages, the technique of web page text extraction based on statistics, Chinese word segmentation module and documents for indexing pretreatment module are added into the system by analyzing the structure of Lucene. In order to im- prove the efficiency of searching information people needed, document clustering is applied in processing the searching results. The experimental results show that the proposed system can effectively improve the performance of the Chinese full text search engine system.
出处
《计算机工程与设计》
CSCD
北大核心
2008年第19期5083-5086,共4页
Computer Engineering and Design
关键词
全文检索
网页正文提取
中文分词模块
索引文档预处理
文本聚类
full text search
web page text extraction
Chinese word segmentation
documents for indexing pretreatment
document clustering