期刊文献+

基于后缀树聚类的主题搜索引擎研究 被引量:4

Research on the Focused Search Engine Based on Suffix Tree Clustering
原文传递
导出
摘要 [目的/意义]一个好的主题搜索引擎能够更好地满足专业领域用户的信息需求。[方法/过程]在爬取阶段采用锚文本正则表达式匹配进行主题过滤、加入IKAnalyzer中文分词器,结合TF-IDF、OPIC和Topic-PageRank算法对检索结果排序进行改进并通过STC算法对检索结果实时聚类。[结果/结论]以"图书情报"为主题进行实验测试,每增加一个分布式计算节点爬取速率提高20%,查准率优于未排序优化23%,检索结果可以实时聚类并以可视化展示,且检索结果项多为相关论文。[局限]系统对网页中繁多的数据格式解析度不够,未解析的部分可能包含主题内容。 [ Purpose/significance] A good focused search engine can meet the professional users' information needs. [ Method/process] The system proposed in this paper implements topic filtering in the stage of crawling by using anchor text regular expression match. On this basis, the paper uses IKAnalyzer Chinese word segmentation machine and combines with TF-IDF, OPIC and Topic-PageRank algorithm to optimize the retrieval results, and applies STC algorithm to real-time clustering of the results. [ Result/conclusion] Using "Library and Information Science" as the theme for test, adding one distributing computing node each time can promote the crawling rate increasing by 20%, the results precision ratio is 23% higher than that of none optimized algorithm, the search results can cluster in real-time and be visualized, and most of the retrieval result items are related papers. [ Limitations ] The content of the web page has various data formats which are not fully analyzed and may contain important content.
出处 《情报理论与实践》 CSSCI 北大核心 2017年第12期123-127,62,共6页 Information Studies:Theory & Application
关键词 主题过滤 后缀树聚类 搜索引擎 topic distillation suffix tree clustering search engine
  • 相关文献

参考文献17

二级参考文献164

共引文献116

同被引文献24

引证文献4

二级引证文献9

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部