期刊文献+

基于概念的网页相似度处理算法研究 被引量:8

Concept based algorithm of dealing near-replicas of documents on the Web
下载PDF
导出
摘要 针对海量网页信息,提出适于搜索引擎使用的网页相似度处理算法。算法依据网页抽象形成的概念,在倒排文档基础上建立相似度处理模型。该模型缩小了需要进行相似度计算的网页文档范围,节约大量时间和空间资源,为优化相似度计算奠定了良好基础。 To solve near-replicas of documents on the Web obtained by search engine, a similarity dealing algorithm was proposed. Based on concepts extracted from the Web pages and inverted file, the algorithm built a model which shrank the scale of the Web pages processed. The algorithm saved a great deal of temporal and spatial resources and provides a good foundation for near-replicas detection.
出处 《计算机应用》 CSCD 北大核心 2006年第12期3030-3032,共3页 journal of Computer Applications
基金 西北工业大学研究生创业种子基金资助项目(Z200644)
关键词 相似网页 概念抽取 聚类分析 消重 near-repllcas documents concept extraction cluster analysis near-replicas detection
  • 相关文献

参考文献8

  • 1SALTON G,MCGILL MJ.Introduction to Modern Information Retrieval[M].McGraw-Hill,Inc.,1983.
  • 2SALTON G.Automatic Text Processin-the Transformation,Analysis and Retrieval of Information by Computer[M].Addison-Wesley Publishing Co.,Reading,MA,1989.
  • 3SHIAN-HUA LIN,JAN-MING HO.Discovering informative content blocks from Web documents[A].Proceedings of the SIGKDD Conference[C].2002.588 -593.
  • 4YANG YM.Noise reduction in a statistical approach to text categorization[A].Proceedings of SIGIR295,18th ACM International Conference on Research and Development in Information Retrieval[C].1995.
  • 5HAN JW,KAMBER M.Data Mining:Concepts and Techniques[M].Morgan Kaufmann Publishers,Inc.,1998.
  • 6ETZWEILER L,MARTIN C.Binary cluster division and its application to a modified single pass clustering algorithm[R].In Report No.ISR-21 to the National Library of Medicine,1972.
  • 7JOON HO LEE.Combining Multiple Evidence from Different Properties of Weighting Schemes[A].Proceeding of the 18th annual international ACM SIGIR conference on Research and development in information retrieval[C].1995.
  • 8BRIN S,PAGE L.The Anatomy of a Large-Scale Hypertextual Web Search Engine[A].Proceedings of the 7th International World Wide Web Conference[C].1998.

同被引文献78

引证文献8

二级引证文献34

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部