基于SLCS的元搜索去重技术研究被引量：1

The Study on the Duplicated Detection Algorithm Based on SLCS with Meta Search Engine

导出

摘要针对元搜索结果中的网页重复问题,把基于最长公共子序列(Longest Common Subsequence,简称LCS)的网页去重方法应用到元搜索引擎的去重中,提出基于SLCS(首字母S表示Summary)的元搜索去重方法。在获得网页文档摘要后,根据查询词在语句中出现的次数和语句长度,计算摘要语句集合中每个语句权重,提取权重最大的语句作为网页摘要特征语句,通过比较摘要特征语句间的LCS,计算出结果网页相似性,以提高元搜索引擎的检索质量,实验表明该方法具有较高的准确率。 Based on the study on the duplicated web pages detection algorithm, the paper proposed a duplicated detection algorithm based on LCS（ Longest Common Subsequence）, and studied the duplicated web pages based on SLCS with meta search engine. The main steps of the SLCS（The first S means Summary）algorithm are introduced： first, we get the weight of each sentence of summary, according to the length of the sentence and the frequency of the keyword that consumer submits in the sentence, then take the largest weight sentence as the feature sentence, finally, get the similarity of summaries of the web pages through comparing the similarity of the sentence. Experiments have proved that the new method can make high performance in precision.

作者秦杰谢蕙王春云

机构地区河南工业大学信息科学与工程学院

出处《图书情报工作》 CSSCI 北大核心 2010年第15期113-116,共4页 Library and Information Service

关键词网页去重元搜索引擎 LCS 特征码 duplicate detection meta-search engine LCS feature code

分类号 TP391.3 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献8

1Wikipedia. Metasearch engine. [2009 - 11 - 21 ]. http://en. wikipedia, org/wiki/Metasearch_engine.
2Meng W, Yu C, Liu K. Building efficient and effective meta search engines. ACM Computing Surveys,2002,34(1) :48 -89.
3Tsai Y. The constrained longest common subsequenee problem. Information Processing Letter,2003,88 (4) : 173 - 176.
4谢蕙.元搜索引擎去重技术研究[学位论文].郑州:河南工业大学,2009.
5Myers E. An O(ND) difference algorithm and its variations. Algorithmica, 1986,1 (2) :251 - 266.
6Tian Z, Lu H, Ji W, et al. An ngram-based approach for detecting approximately duplicate database records. International Journal on Digital Library, 2002,3(4) :325 -331.
7Huang L, Wang L, Li X. Achieving both high precision and high recall in near-duplicate detection//Proceeding of the 17th ACM conference on Information and knowledge management. California, USA : ACM ,2008:63 - 72 .
8Fetterly D, Manasse M, Najork M. On the evolution of clusters of near- duplicate web pages//Proceeding of First Latin American Web Congress, Washington, DC, USA:IEEE Computer Society, 2003 : 37 - 45.

同被引文献10

1王建勇,谢正茂,雷鸣,李晓明.近似镜像网页检测算法的研究与评价[J].电子学报,2000,28(z1):130-132. 被引量：21
2张刚,刘挺,郑实福,等.大规模网页快速去重算法[C].中国中文信息学学会二十周年学术会议论文集(续集),2001.
3Lopresti D P. Models and algorithms for duplicate document detec- tion. In Proceedings of the Fifth International Conference on Docu- ment Analysis and Recognition. Bangalore. India. September 1999: 305-308.
4Broder A, Glassman S C, Manasse M S. Synactic clustering of the Web. In: Sixth International World Wide Web Conference, April 1997.
5Huang L, Wang L, Li X. Achieving both High Precision and High Recall in Near-duplicate Detection. Proceeding of the 17 the ACM conference on Information and knowledge management. California, USA: ACM, 2008 : 63-72.
6Breder A Z. On the resemblance and containment of documents. Prec. Compression and Complexity of SEQUENCES, IEEE Comput- er Society, 1997:21-29.
7郭庆琳,李艳梅,唐琦.基于VSM的文本相似度计算的研究[J].计算机应用研究,2008,25(11):3256-3258. 被引量：101
8施聪莺,徐朝军,杨晓江.TFIDF算法研究综述[J].计算机应用,2009,29(B06):167-170. 被引量：218
9吴平博,陈群秀,马亮.基于特征串的大规模中文网页快速去重算法研究[J].中文信息学报,2003,17(2):28-35. 被引量：41
10鲍军鹏,沈钧毅,刘晓东,宋擒豹.自然语言文档复制检测研究综述[J].软件学报,2003,14(10):1753-1760. 被引量：69

引证文献1

1徐朝辉,赵淑梅,闫付亮,秦杰.一种基于特征向量的改进DSC网页去重算法[J].科学技术与工程,2013,21(8):2250-2253. 被引量：1

二级引证文献1

1刘驰,闫宏飞.基于元信息的云盘资源检索结果去重[J].山东大学学报（理学版）,2016,51(7):11-17.

1陈伟,魏志强,赵东旭.基于图形表示的LCS问题[J].计算技术与自动化,2006,25(2):78-80.
2刘驰,闫宏飞.基于元信息的云盘资源检索结果去重[J].山东大学学报（理学版）,2016,51(7):11-17.
3付兴娥.CAD在机械设计中的应用现状与发展趋势[J].山东工业技术,2015(22):131-131. 被引量：1
4秦鹏,张华平,刘金刚.基于新词发现技术的关键词提算法的研究[J].微计算机信息,2010,26(33):257-258. 被引量：7
5王志军.解决工作表数据的重复问题[J].电脑知识与技术（经验技巧）,2013(7):39-39. 被引量：1
6高雪霞,贾海龙.基于语句类似度优化计算的改进自动摘要算法研究[J].计算机应用与软件,2013,30(9):160-162. 被引量：3
7郝光权,李十子.基于Nutch的搜索引擎网页摘要改进[J].计算机光盘软件与应用,2011(4):137-138.
8陈志敏,沈洁,赵耀.一种基于DOM的Web文档主题划分方法[J].计算机应用与软件,2009,26(8):59-61.
9黄恩博.基于布隆过滤器的网页搜索去重方法[J].现代计算机,2013,19(14):7-10. 被引量：4
10朱蔚恒,印鉴,邓玉辉,龙舜,邱诗定.大数据环境下高维数据的快速重复检测方法[J].计算机研究与发展,2016,53(3):559-570. 被引量：12

图书情报工作

2010年第15期

浏览历史

内容加载中请稍等...

基于SLCS的元搜索去重技术研究被引量：1

参考文献8

同被引文献10

引证文献1

二级引证文献1

相关作者

相关机构

相关主题

浏览历史

基于SLCS的元搜索去重技术研究 被引量：1

参考文献8

同被引文献10

引证文献1

二级引证文献1

相关作者

相关机构

相关主题

浏览历史

基于SLCS的元搜索去重技术研究被引量：1