期刊文献+

基于SVM的重复网页检测算法

下载PDF
导出
摘要 互联网中存在大量重复网页,降低了用户体验并使搜索变得复杂化。为解决这些问题,把相似网页的比较转换成二元分类问题,使用监督学习算法构造判别函数,避免人为设定相似度阈值所带来的误差;通过SVM训练出的判别函数检测网页对,以此检测网页是否重复。
作者 冯金波
出处 《软件导刊》 2015年第3期57-58,共2页 Software Guide
  • 相关文献

参考文献9

二级参考文献17

  • 1章成志.基于多层特征的字符串相似度计算模型[J].情报学报,2005,24(6):696-701. 被引量:40
  • 2[1]T.W. Yan and H. Garcia- Molina. Duplicate removal in information dissemination. In Proceedings of the 21st International Conference on Very Large Data Bases(VLDB' 95) ,66 - 77,San Francisco,Ca., USA,September 1995. Morgan Kaufmann Publishers, Inc.
  • 3[2]Narayanan Shivakumar and Hector Garcia- Molina. SCAM: a copy detection mechanism for digital documents. In Proceedings of 2nd International Conference in Theory and Practice of Digital Libraries (DL'95) ,Austin, Texas,June 1995.
  • 4[3]T. Yan and H. Garcia- Molina. The sift information dissemination system. In ACM TODS,2000.
  • 5[4]J.W. Kirriemuir & P. Willett Identification of duplicate and near - duplicate full - text records in database search outputs using hierarchic cluster analysis,in Program-automated library and information,(1995)29(3) :241-256.
  • 6[5]Buckley C. ,Cardie C. ,Mardis S. ,Mitra M. ,Pierce D. ,Wagstaff K. ,Walz J. ,The Smart/Empire TIPSTER IR System, TIPSTER Phase Ⅲ Proceedings,Morgan Kaufmann,San Francisco,CA,2000.
  • 7Yazdani N,Ozsoyoglu Z M.Sequence matching of images[C]//Proceedings of the IEEE International Conference on Multimedia Computing and Systems,Volume Ⅱ, 1996:53-62.
  • 8Hunt J W,Szymanski T G.A fast algorithm for computing longest common subsequences[J].Communications of the ACM, 1977,20(5): 350-353.
  • 9Sutinen E,Tarhio J.Approximate string matching with ordered qgrams[J].Nordic Journal of Computing, 2004, 11 (4) : 321-343.
  • 10Setubal,Meidanis J.Introduction to computation molecular biology. University of Campinas,Brazil, 1997.

共引文献120

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部