摘要
在Lucene的全文检索工具包的基础上,分析现有的主流中文分词算法和Lucene相关度排序算法,提出了改进的分词算法和改进的相关度排序算法。还采用倒排索引、检索技术、分布式存储和并行计算等技术,分析并设计了一个对海量数字作品信息的搜索引擎,为用户提供对海量数字作品信息的快速、准确的搜索服务。实验分析比较了分词速度和分词效果,还比较了关键词搜索结果的响应时间、命中数量、准确率和召回率。实验结果表明,本系统在很大程度上提高了搜索速度,保证了搜索结果的准确性。
On the basis of the Lucene's full-text retrieval toolkit, the current main Chinese word segmentation algorithm and the Lucene relevance sorting algorithm was analyzed, and an improved segmen- tation algorithm and an improved relevance sorting algorithm were proposed. The paper also used the inverted index, search technologies, distributed storage and parallel computing to analyze and design a search engine for the massive digital works, thus providing users with fast and accurate search service of massive digital works. The experiments compared the segmentation speed, segmentation results and the response time of the keyword search results, the hit number, accuracy and recall rate. The experiment results show that this system does improve the search speed and ensure the accuracy of search results.
出处
《计算机工程与科学》
CSCD
北大核心
2013年第5期166-172,共7页
Computer Engineering & Science
基金
国家科技部支撑计划课题基金资助项目(2012BAH04f03)
科研基地-科研创新平台资助项目(PXM2013_014212_000011)