摘要
为提高搜索引擎的主题倾向性和准确率,在开源Nutch搜索引擎基础上搭建面向新药研发的垂直搜索引擎,详细阐述系统体系结构、工作流程和关键技术,包括URL种子确定、暗网抓取、主题相关性判定等方面。
In order to improve subject tendency and promote accuracy of search engine, the paper constructs vertical search engine towards new drugs development based on Nutch, an open source search engine. It concretely elaborates the system architecture, workflow and key technologies, including URL seed selection, deep web crawling, subject relevance definition, etc.
出处
《医学信息学杂志》
CAS
2013年第10期38-42,66,共6页
Journal of Medical Informatics
关键词
NUTCH
暗网抓取
URL种子
新药研发
垂直搜索引擎
Nutch
Hidden web search
URL seeds
New drugs research and development
Vertical search engine