一种双层网页去重方法研究

Research on Deletion of Duplicated Web Pages on Two Levels

下载PDF

导出

摘要利用Bloom Filter数据结构、shingling算法和MD5编码,构造双层网页去重模型。通过Bloom Filter结构,在网络蜘蛛程序下载网页时,去除重复的网址,并讨论了Bloom Filter出错概率。对已下载的网页用shingling算法去重,阐述了相似网页的判断方法。通过实验,得到了最后的结果,并指出了模型存在的缺点和该方法的后续研究方向。 This paper constructs the model of deletion of Duplicated web collections on two levels with Bloom Filter、Shingling Algorithm and MD5. With the help of Bloom Filter, it deletes Duplicated web collections while the web Spider is working. And also discuss the false rate of Bloom Filter. Then using Shingling to judge similar web pages and delete similar ones. Get the final results through experiments and put forward directions of further study.

作者毛晓蛟

机构地区南京师范大学强化培养学院

出处《电脑编程技巧与维护》 2010年第20期66-67,84,共3页 Computer Programming Skills & Maintenance

关键词 BLOOM FILTER 错误率 shingling MD5 相似网页 Bloom Filter false rate Shingling MD5 similar web pages

分类号 TP393.092 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献9

1A. Broder and M. Mitzenmacher. Network applications of bloom filters: A survey. Internet Mathematics,2005, 1(4) :485- 509.
2M. Mitzenmacher. Compressed Bloom Filters. IEEE/ACM Transactions on Networking, 2002, 10 (5) : 604-612.
3www.cs.jhu.edu/-fabian/courses/CS600.624/shdes/bloomslides. pdf.
4http://166.111.248.20/seminar/2006_11_23/hash_2_yaxuan.ppt.
5http://blog.csdn.net/jiaomeng/archive/2007/01/27/1495500.aspx.
6Gurmeet Singh Manku, Arvind Jain, Anish Das Sarma. De tecting Near- Duplicatesfor Web Crawlng [ C] .International World Wide Web Conference,Banff, Alberta, Canada, New York, USA: ACM, 2007:141-150.
7Moses S. Charikar, Similarity Estimation Techniques from Rounding Algorithms [C] . Annual ACM Symposium on Theory of Computing, Montreal, Quebec, Canada, New York, USA: ACM, 2002:380-388.
8中国互联网络信息中心．第十六次中国互联网络发展状况统计报告[EB／OL]．http://www．cnnic．net．cn／index／OE／00／11／index．htm．2005—07—01．
9Andrei Z. Broder, Steven C. Glassman. Syntactic Clustering of the Web [DB/OL] .http://gatekeeper.research.compaq.com/pub/ DEC/SRC/technical-notes/SRC .

共引文献4

1高凯,王永成,肖君.网页去重策略[J].上海交通大学学报,2006,40(5):775-777. 被引量：13
2邱明明,吴国新.一种个性化垃圾邮件识别系统的设计[J].计算机技术与发展,2007,17(1):136-138. 被引量：4
3潘勇.电子商务市场中信任战略的建立与实施——基于案例的分析[J].商业经济与管理,2007(2):23-27. 被引量：3
4林璇.茶商电子商务研究[J].安徽农业科学,2010,38(22):12286-12288. 被引量：10

1赵德平,蔡丽静,李鹏.基于Newshingling的相似文本检测算法[J].沈阳建筑大学学报（自然科学版）,2011,27(4):771-775. 被引量：1
2符于江.基于内容特征码的重复网页检测方法探析[J].科技信息,2012(26):162-163.
3小黔.快速关闭相似网页[J].网友世界,2006(17):33-33.
4陈忠菊.一种基于HTTP的网络蜘蛛实现[J].电脑编程技巧与维护,2015(7):64-64.
5赵坤.网页过滤中文本内容的抽取技术研究[J].硅谷,2010,3(8):64-64.
6连浩,刘悦,许洪波,程学旗.改进的基于布尔模型的网页查重算法[J].计算机应用研究,2007,24(2):36-39. 被引量：7
7草无缺.揪出相似网页[J].电脑迷,2006,0(13):77-77.
8马成前,毛许光.网页查重算法Shingling和Simhash研究[J].计算机与数字工程,2009,37(1):15-17. 被引量：17
9Jing.快速揪出相似网页[J].网友世界,2006(14):28-28.
10郭晨娟,李战怀.基于概念的网页相似度处理算法研究[J].计算机应用,2006,26(12):3030-3032. 被引量：8

电脑编程技巧与维护

2010年第20期

浏览历史

内容加载中请稍等...

一种双层网页去重方法研究

参考文献9

共引文献4

相关作者

相关机构

相关主题

浏览历史