摘要
目前互联网上的信息正在飞速的增长,人们主要依靠搜索引擎查找信息,随着专业化不断加强,垂直搜索引擎成为人们新的工具,但构建专业化搜索引擎过程比较复杂。为了解决垂直搜索引擎中主题爬虫配置不灵活的问题,采用在爬虫上集成规则引擎的方法,通过规则库来控制爬虫运行,并且使用可扩展度较高的开源爬虫项目Heritrix和开源规则引擎项目Drools,构建配置方便、灵活度高的个性化爬虫,从而将原先主题爬虫的设置从紧耦合转变成了松耦合,降低了用户配置难度。
Information on the interact is now rapid growth, people mainly rely on search engines to find information, continue to strengthen as the specialized, vertical search engines become the new tool, but the process of building specialized search engines is more complex. In order to solve focused crawler is not configured flexible on vertical search engines, adopt an integrated rules engine in the reptile on the way to control the reptiles through the rule base running, and use a higher degree of open source scalable Heritrix crawler project and open source rules engine project Drools, easy to build configuration , and high flexibility of individual reptiles, which will set the original theme from the tight coupling reptiles turned into loosely coupled, reducing the user configuration difficult.
出处
《计算机技术与发展》
2011年第3期56-59,63,共5页
Computer Technology and Development
基金
信息产业部电子发展基金项目(信部运[2006]634号)
关键词
规则引擎
主题爬虫
搜索引擎
rules engine
subject crawler
search engine