摘要
针对元搜索结果中的网页重复问题,把基于最长公共子序列(Longest Common Subsequence,简称LCS)的网页去重方法应用到元搜索引擎的去重中,提出基于SLCS(首字母S表示Summary)的元搜索去重方法。在获得网页文档摘要后,根据查询词在语句中出现的次数和语句长度,计算摘要语句集合中每个语句权重,提取权重最大的语句作为网页摘要特征语句,通过比较摘要特征语句间的LCS,计算出结果网页相似性,以提高元搜索引擎的检索质量,实验表明该方法具有较高的准确率。
Based on the study on the duplicated web pages detection algorithm, the paper proposed a duplicated detection algorithm based on LCS( Longest Common Subsequence), and studied the duplicated web pages based on SLCS with meta search engine. The main steps of the SLCS(The first S means Summary)algorithm are introduced: first, we get the weight of each sentence of summary, according to the length of the sentence and the frequency of the keyword that consumer submits in the sentence, then take the largest weight sentence as the feature sentence, finally, get the similarity of summaries of the web pages through comparing the similarity of the sentence. Experiments have proved that the new method can make high performance in precision.
出处
《图书情报工作》
CSSCI
北大核心
2010年第15期113-116,共4页
Library and Information Service
关键词
网页去重
元搜索引擎
LCS
特征码
duplicate detection meta-search engine LCS feature code