摘要
聚类技术能将大规模数据按照数据的相似性划分成用户可迅速理解的簇,从而使用户更快地了解大量文档中所包含的内容。因此,聚类技术成为搜索引擎中不可或缺的部分和研究热点。Web上的AJAX应用和PowerPoint文件等弱链接文档由于缺乏足够的超链接信息,导致搜索该类文档时,排序结果不佳。针对该问题,给出一个弱链接文档的搜索引擎框架,并重点描述一个基于网页搜索结果的弱链接文档排序算法。基于聚类的弱链接文档排序算法利用聚类算法从高质量的网页搜索结果中提取与查询相关的主题,并根据主题的相关网页的排名确定该主题的重要性,根据识别的带权重的主题计算弱链接文档的排序值。实验结果表明该算法能够为弱链接文档产生较好的排序结果。
Clustering technology can partition a large number of documents into a small number of clusters according to document similarities.The generated clusters help people to understand documents quickly.Clustering technology plays an important role in SE and attracts a lot of interests from both industry and academic.The current search engine cannot rank well weak-linked docu ments such as PowerPoint files and AJAX applications.Current search engines return therefore either completely irrelevant results or poorly ranked documents when searching for these files. Proposes novel framework for correctly retrieving and Ranking weak-linked documents based on Clustering.The experiments show that our approach considerably improves the result quality of current search engines and that of latent semantic indexing.
出处
《现代计算机》
2013年第19期3-7,共5页
Modern Computer
关键词
搜索引擎
聚类技术
弱链接文档
Search Engine
Clustering Technology
Weak-Linked Document