摘要
面对日益专业和个性化的信息检索需求,通用搜索引擎存在的问题暴露无遗。垂直搜索技术作为搜索引擎发展的一个主要方向,正在受到越来越多的关注。在给出一个垂直搜索引擎总体结构的基础上,详细分析了所涉及的关键技术:网页抓取、中文分词、文本分类等。并将分词和分类算法加入到Nutch中,实现了系统原型。实验证明,该系统主题相关度达到94%以上。
Faced with increasingly professional and personalized needs of information retrieval, the problem for general search engine is exposed. It is more and more attention for vertical search technology as a major direction of search engine development. To gives the general structure of a vertical search engine, based on it, the key technologies involved are analyzed in detail: Web crawling, Chinese word segmentation, text classification and so on. And segmentation and classification algorithms add to the Nutch, realizing a prototype system. Experiments show that the degree of the system subject is to 94%.
出处
《情报科学》
CSSCI
北大核心
2011年第3期421-424,439,共5页
Information Science
基金
张家口市2009年科技攻关项目(0921047B)
关键词
垂直搜索引擎
中文分词
文本分类
主题相关度
NUTCH
vertical search engine
chinese word segmentation
text classification
degree of the system subject
hutch