期刊文献+

一种双层网页去重方法研究

Research on Deletion of Duplicated Web Pages on Two Levels
下载PDF
导出
摘要 利用Bloom Filter数据结构、shingling算法和MD5编码,构造双层网页去重模型。通过Bloom Filter结构,在网络蜘蛛程序下载网页时,去除重复的网址,并讨论了Bloom Filter出错概率。对已下载的网页用shingling算法去重,阐述了相似网页的判断方法。通过实验,得到了最后的结果,并指出了模型存在的缺点和该方法的后续研究方向。 This paper constructs the model of deletion of Duplicated web collections on two levels with Bloom Filter、Shingling Algorithm and MD5. With the help of Bloom Filter, it deletes Duplicated web collections while the web Spider is working. And also discuss the false rate of Bloom Filter. Then using Shingling to judge similar web pages and delete similar ones. Get the final results through experiments and put forward directions of further study.
作者 毛晓蛟
出处 《电脑编程技巧与维护》 2010年第20期66-67,84,共3页 Computer Programming Skills & Maintenance
关键词 BLOOM FILTER 错误率 shingling MD5 相似网页 Bloom Filter false rate Shingling MD5 similar web pages
  • 相关文献

参考文献9

  • 1A. Broder and M. Mitzenmacher. Network applications of bloom filters: A survey. Internet Mathematics,2005, 1(4) :485- 509.
  • 2M. Mitzenmacher. Compressed Bloom Filters. IEEE/ACM Transactions on Networking, 2002, 10 (5) : 604-612.
  • 3www.cs.jhu.edu/-fabian/courses/CS600.624/shdes/bloomslides. pdf.
  • 4http://166.111.248.20/seminar/2006_11_23/hash_2_yaxuan.ppt.
  • 5http://blog.csdn.net/jiaomeng/archive/2007/01/27/1495500.aspx.
  • 6Gurmeet Singh Manku, Arvind Jain, Anish Das Sarma. De tecting Near- Duplicatesfor Web Crawlng [ C] .International World Wide Web Conference,Banff, Alberta, Canada, New York, USA: ACM, 2007:141-150.
  • 7Moses S. Charikar, Similarity Estimation Techniques from Rounding Algorithms [C] . Annual ACM Symposium on Theory of Computing, Montreal, Quebec, Canada, New York, USA: ACM, 2002:380-388.
  • 8中国互联网络信息中心.第十六次中国互联网络发展状况统计报告[EB/OL].http://www.cnnic.net.cn/index/OE/00/11/index.htm.2005—07—01.
  • 9Andrei Z. Broder, Steven C. Glassman. Syntactic Clustering of the Web [DB/OL] .http://gatekeeper.research.compaq.com/pub/ DEC/SRC/technical-notes/SRC .

共引文献4

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部