期刊文献+

基于傅立叶变换的网页去重算法 被引量:2

Finding near replicas of Web pages based on Fourier transform
下载PDF
导出
摘要 去除重复网页可以提高搜索引擎的搜索精度,减少数据存储空间。目前文本去重算法以关键词去重、语义指纹去重为主,用上述算法进行网页去重时容易发生误判。通过对字符关系矩阵进行K-L展开,将每个字符映射成为一个数值,然后对这个数值序列做离散傅立叶变换,得到每个网页的傅立叶系数向量,通过比较傅立叶系数向量差异实现对网页的相似度判断。实验结果表明该方法可对网页实现较好的去重。 Removing duplicated Web pages can improve the searching accuracy and reduce the data storage space, Current de-duplication algorithms mainly focus on keywords de-duplication or semantic fingerprint de-duplication and may cause error when processing Web pages. In this paper each character was mapped into a semantic value by Karhunen-Loeve (K-L) transform of the relationship matrix, and then each document was transformed into a series of discrete values. By Fourier transform of the series each Web page was expressed as several Fourier coefficients, and then the similarity between two Web pages was calculated based on the Fourier coefficients. Experiment results show that this method can find similar Web pages efficiently.
出处 《计算机应用》 CSCD 北大核心 2008年第4期948-950,共3页 journal of Computer Applications
关键词 网页去重 K—L展开 傅立叶变换 维数压缩 duplicate removal of Web pages Karhunen-Loeve (K-L) transform Fourier transform dimensions reduction
  • 相关文献

参考文献4

  • 1HENZINGERA M . Finding near - duplicate Web pages : A largescale evaluation of algorithms[ C]// Annual ACM Conference on Research and Development in Information Retrieval. Washington: ACM Press, 2006:284-291.
  • 2MANBER U. Finding similar files in a large file system[ C]//Usenix Winter 1994 Technical Conference. Berkeley, CA, USA: USENIX Association, 1994:2 - 2.
  • 3吴平博,陈群秀,马亮.基于特征串的大规模中文网页快速去重算法研究[J].中文信息学报,2003,17(2):28-35. 被引量:41
  • 4LI WEI, LIU JIAN - YI, WANG CONG. Web document duplicate removal algorithm based on keyword sequences[ C]// Natural Language Processing and Knowledge Engineering. Valencia: IEEE Press, 2005:511-516.

二级参考文献5

  • 1[1]T.W. Yan and H. Garcia- Molina. Duplicate removal in information dissemination. In Proceedings of the 21st International Conference on Very Large Data Bases(VLDB' 95) ,66 - 77,San Francisco,Ca., USA,September 1995. Morgan Kaufmann Publishers, Inc.
  • 2[2]Narayanan Shivakumar and Hector Garcia- Molina. SCAM: a copy detection mechanism for digital documents. In Proceedings of 2nd International Conference in Theory and Practice of Digital Libraries (DL'95) ,Austin, Texas,June 1995.
  • 3[3]T. Yan and H. Garcia- Molina. The sift information dissemination system. In ACM TODS,2000.
  • 4[4]J.W. Kirriemuir & P. Willett Identification of duplicate and near - duplicate full - text records in database search outputs using hierarchic cluster analysis,in Program-automated library and information,(1995)29(3) :241-256.
  • 5[5]Buckley C. ,Cardie C. ,Mardis S. ,Mitra M. ,Pierce D. ,Wagstaff K. ,Walz J. ,The Smart/Empire TIPSTER IR System, TIPSTER Phase Ⅲ Proceedings,Morgan Kaufmann,San Francisco,CA,2000.

共引文献40

同被引文献26

引证文献2

二级引证文献11

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部