摘要
设计实现了一个基于Lucene的全文检索系统模型.在该系统模型中,针对中文分词实现了基于词库的采用正向最大匹配算法的中文分词模块;针对多种格式文档的处理采用接口实现的方式和动态实例化的方法,实现了可以有效地处理txt、xml、html、pdf、doc和rtf等常见格式文档.
A Lucene-based full-text retrieval model was designed and implemented. For Chinese words segmentation, a module which is based on word library and uses the positive direction maximum matching algorithm was presented. Further more, 1 for processing the documents of various formats, interfaces and dynamic instantiation are used in the system model, so it can effectively process common formatted documents such as txt, xml, html, pdf, doe and rtf, etc.
出处
《暨南大学学报(自然科学与医学版)》
CAS
CSCD
北大核心
2009年第5期504-508,共5页
Journal of Jinan University(Natural Science & Medicine Edition)
基金
国家自然科学基金-广东省科学基金联合重点项目(U0775001)
关键词
全文检索
中文分词
格式文档
full-text retrieval
Chinese words segmentation
formatted documents