摘要
专题搜索引擎也称垂直搜索引擎,主要用来满足特定领域的用户需求。Heritrix是开源的网络爬虫,Heritrix的WebUI启动方式并不易用于广大用户。本文改变了往常对Heritrix用法,摒弃了Heritrix的WebUI启动方式,对Heritrix源码进行修改,将Lucene整合到Heritrix中,构建成一个完整的搜索引擎,并通过监听器监听搜索引擎状态,使搜索引擎能够进行自动爬取和数据更新。同时,本文添加了网页过滤模块以及对查询结果排序算法进行了改进,提高了搜索引擎的易用性和查询的准确率。
thematic search engine,also known as vertical search engines,mainly used to meet specific user needs.Heritrix is an open source Web crawler Heritrix the WebUI start way is not easy for the majority of users.Changed the usual Heritrix usage abandon the way of the Heritrix of WebUI start Heritrix source code be modified to integrate Lucene into Heritrix build into a complete search engine,and through the listener to monitor the status of the search engine,search engines can automatic crawling and data updates.Meanwhile,the paper added Web filtering module,and query results sorting algorithm has been improved,easy-to-use search engine and query accuracy.
出处
《中国科技信息》
2012年第10期95-96,共2页
China Science and Technology Information