摘要
利用Bloom Filter数据结构、shingling算法和MD5编码,构造双层网页去重模型。通过Bloom Filter结构,在网络蜘蛛程序下载网页时,去除重复的网址,并讨论了Bloom Filter出错概率。对已下载的网页用shingling算法去重,阐述了相似网页的判断方法。通过实验,得到了最后的结果,并指出了模型存在的缺点和该方法的后续研究方向。
This paper constructs the model of deletion of Duplicated web collections on two levels with Bloom Filter、Shingling Algorithm and MD5. With the help of Bloom Filter, it deletes Duplicated web collections while the web Spider is working. And also discuss the false rate of Bloom Filter. Then using Shingling to judge similar web pages and delete similar ones. Get the final results through experiments and put forward directions of further study.
出处
《电脑编程技巧与维护》
2010年第20期66-67,84,共3页
Computer Programming Skills & Maintenance