摘要
设计并实现了一个基于相似聚类算法的垂直搜索引擎。利用网络爬虫NWebCrawler,通过定制正则表达式,高效爬取所需的URL;通过解析爬取的URL信息,提取结构化数据;利用正向最大匹配算法,对搜索关键字分词;利用向量空间模型,根据相似度值对搜索结果聚类;基于Lucene建立索引,检索所需信息。实验结果表明,基于相似聚类算法的垂直搜索引擎,比通用搜索引擎的准确率和召回率高,与普通的垂直搜索引擎相比,具备了相似产品查询功能。
A vertical search engine is designed and implemented based on similar clustering algorithm. By using web crawler NWebCrawler and the custom regular expressions, useful URLs is crawled efficiently. Through analyzing the information of URL, structured data are extracted. The key words are segmented by using positive maximum matching algorithm and the value of the searching results is clustered by using VSM (vector space model). Finally, index is created based on Lucene and information is retrieved. The experimental results show that the vertical search engine based on similar clustering algorithm is higher on the rate of accuracy and recall. Compared with the ordinary vertical search engine, it has a function of inquiring similar products.
出处
《北京信息科技大学学报(自然科学版)》
2013年第1期38-41,共4页
Journal of Beijing Information Science and Technology University
基金
国家自然科学基金资助项目(60873013
61070119)
北京大学计算语言学教育部重点实验室开放课题基金资助项目(KLCL-1005)
北京市属市管高等学校人才强教计划资助项目(PHR201007131)
关键词
搜索引擎
爬虫
聚类
正则表达式
search engine
reptile
clustering
regular expression