期刊文献+

基于网页分块的搜索引擎排序算法改进

The improvement for sorting algorithm of search engine based on webpage segmentation
下载PDF
导出
摘要 目前,搜索引擎以整张网页作为最小处理单位进行排序处理,容易受到噪音信息的干扰.针对存在的问题,提出用网页分块对网页净化,进而利用净化结果改进传统的排序算法.首先,用基于视觉的网页分块算法VIPS将网页分成若干语义块,然后通过设定规则保留网页中与主题相关度高的语义块,最后用这些语义块代表整个网页参与检索,减少网页噪音对搜索引擎排序算法正确性的影响,实现了检索质量的改进.最后通过实验证明了改进算法的优越性. At present, an entire webpage is used as sorting unit in the search engine. This method is vulnerable to noise interference. In order to overcome the problem, the webpage segmentation method is proposed to purify the webpage in this paper. The purified webpage is used to improve the sorting algorithm in the search engine. Firstly, the webpage segmentation algorithm VIPS based on vision is used to divide the webpage into several semantic blocks. Then, the semantic blocks with highly relevant to the subject is reserved through setting rules. Finally, these semantic blocks on behalf of the entire webpage will be used in the search engine. It effectively reduces the impact of noise on the sorting algorithm in webpage search engine and improves the search quality. The experiment shows that the strategies proposed in this paper are practical.
出处 《浙江工业大学学报》 CAS 北大核心 2009年第5期495-498,共4页 Journal of Zhejiang University of Technology
关键词 网页噪音 网页分块 网页净化 排序算法 VIPS webpage noise webpage segmentation webpage purification sorting algorithm VIPS
  • 相关文献

参考文献6

  • 1常璐,夏祖奇.搜索引擎的几种常用排序算法[J].图书情报工作,2003,47(6):70-73. 被引量:26
  • 2陈光.Lucene研究之一:起源、现状及初步用[EB/OL].[2004-08-23].http://blog.csdn.net/ncflywolf/archive/2005/06/29/407586.aspx.
  • 3胡涛,路红英.基于Nutch的搜索引擎的研究[J].计算机时代,2007(1):57-59. 被引量:16
  • 4LIN Shianhun,HO Janming.Discovering informative content blocks from web documents[C]//Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.Edmonton,Alberta,Canada:ACM Press,2002:588-593.
  • 5CHEN Jinlin,ZHOU Baoyao,SHI Jin,et al.Function based object model towards website adaptation[C]//Proceedings of the 10th World Wide Web Conference.Hong Kong,China:ACM Press,2001:587-596.
  • 6CAI Deng,YU Shipeng,WEN Jirong.VIPS:a vision-based page segmentation algorithm[EB/OL].[2003-11-01].http://research.microsoft.com/~jrwen/jrwen_files/publications/VIPS_Technical%20Report.PDF.

二级参考文献7

  • 1[1]李晓明,闰宏飞,王继民.搜索引擎-原理、技术与系统[M].科学出版社,2004.
  • 2[2]ERIK HATCHER、OTIS GOSPODNETIC.Lucene in Action[M],2005.
  • 3[3]FELIXJOACHIM.GettingNutchRunningWithWindows.http://wiki.apache.org/nutch/GettingNutchRunningWithWindows,2005.
  • 4[4]TOM WHITE.Introduction to Nutch.http://today.java.net/pub/a/today/2006/01/10/introduction -to -nutch-1.htnl,2006.
  • 5[5]李刚,宋伟,邱哲.Ajax+Lucene构建搜索引擎[M].人民邮电出版社,2006.
  • 6丁璇,侯汉清,章成志.中文网页标引源主题表达能力的调查统计[J].大学图书馆学报,2002,20(6):70-72. 被引量:29
  • 7夏祖奇,黄水清,赵展春.基于分类目录的元搜索引擎模型的提出与实现[J].情报学报,2003,22(1):27-31. 被引量:7

共引文献40

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部