期刊文献+

基于SLCS的元搜索去重技术研究 被引量:1

The Study on the Duplicated Detection Algorithm Based on SLCS with Meta Search Engine
原文传递
导出
摘要 针对元搜索结果中的网页重复问题,把基于最长公共子序列(Longest Common Subsequence,简称LCS)的网页去重方法应用到元搜索引擎的去重中,提出基于SLCS(首字母S表示Summary)的元搜索去重方法。在获得网页文档摘要后,根据查询词在语句中出现的次数和语句长度,计算摘要语句集合中每个语句权重,提取权重最大的语句作为网页摘要特征语句,通过比较摘要特征语句间的LCS,计算出结果网页相似性,以提高元搜索引擎的检索质量,实验表明该方法具有较高的准确率。 Based on the study on the duplicated web pages detection algorithm, the paper proposed a duplicated detection algorithm based on LCS( Longest Common Subsequence), and studied the duplicated web pages based on SLCS with meta search engine. The main steps of the SLCS(The first S means Summary)algorithm are introduced: first, we get the weight of each sentence of summary, according to the length of the sentence and the frequency of the keyword that consumer submits in the sentence, then take the largest weight sentence as the feature sentence, finally, get the similarity of summaries of the web pages through comparing the similarity of the sentence. Experiments have proved that the new method can make high performance in precision.
出处 《图书情报工作》 CSSCI 北大核心 2010年第15期113-116,共4页 Library and Information Service
关键词 网页去重 元搜索引擎 LCS 特征码 duplicate detection meta-search engine LCS feature code
  • 相关文献

参考文献8

  • 1Wikipedia. Metasearch engine. [2009 - 11 - 21 ]. http://en. wikipedia, org/wiki/Metasearch_engine.
  • 2Meng W, Yu C, Liu K. Building efficient and effective meta search engines. ACM Computing Surveys,2002,34(1) :48 -89.
  • 3Tsai Y. The constrained longest common subsequenee problem. Information Processing Letter,2003,88 (4) : 173 - 176.
  • 4谢蕙.元搜索引擎去重技术研究[学位论文].郑州:河南工业大学,2009.
  • 5Myers E. An O(ND) difference algorithm and its variations. Algorithmica, 1986,1 (2) :251 - 266.
  • 6Tian Z, Lu H, Ji W, et al. An ngram-based approach for detecting approximately duplicate database records. International Journal on Digital Library, 2002,3(4) :325 -331.
  • 7Huang L, Wang L, Li X. Achieving both high precision and high recall in near-duplicate detection//Proceeding of the 17th ACM conference on Information and knowledge management. California, USA : ACM ,2008:63 - 72 .
  • 8Fetterly D, Manasse M, Najork M. On the evolution of clusters of near- duplicate web pages//Proceeding of First Latin American Web Congress, Washington, DC, USA:IEEE Computer Society, 2003 : 37 - 45.

同被引文献10

引证文献1

二级引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部