期刊文献+

基于概念和语义网络的近似网页检测算法 被引量:15

Near Duplicated Web Pages Detection Based on Concept and Semantic Network
下载PDF
导出
摘要 在搜索引擎的检索结果页面中,用户经常会得到内容近似的网页.为了提高检索整体性能和用户满意度,提出了一种基于概念和语义网络的近似网页检测算法DWDCS(near-duplicate webpages detection based on concept and semantic network).改进了经典基于小世界理论提取文档关键词的算法.首先对文档概念进行抽取和归并,不但解决了"表达差异"问题,而且有效降低了语义网络的复杂度;从网络结构的几何特征对其进行分析,同时利用网页的语法和结构信息构建特征向量进行文档相似度的计算,由于无须使用语料库,使得算法天生具有领域无关的优点.实验结果表明,与经典的网页去重算法(I-Match)和单纯依赖词汇共现小世界模型的算法相比,DWDCS具有很好的抵抗噪声的能力,在大规模实验中获得了准确率>90%和召回率>85%的良好测试结果.良好的时空间复杂度及算法性能不依赖于语料库的优点,使其在大规模网页去重实际应用中获得了良好的效果. Reprinting websites and blogs produces a great deal redundant WebPages. To improve search efficiency and user satisfaction, the near-Duplicate WebPages Detection based on Concept and Semantic network (DWDCS) is proposed. In the course of developing a near-duplicate detection system for a multi-billion pages repository, this paper makes two research contributions. First, the key concept is extracted, instead of the keyphrase, to build Small Word Network (SWN). This not only reduces the complexity of the semantic network, but also resolves the "expression difference" problem. Second, this paper considers both syntactic and semantic information to present and compute the documents' similarities. In a large-scale test, experimental results demonstrate that this approach outperforms that of both I-Match and keyphrase extraction algorithms based on SWN. Many advantages such as linear time and space complexity, without using a corpus, make the algorithm valuable in actual practice.
出处 《软件学报》 EI CSCD 北大核心 2011年第8期1816-1826,共11页 Journal of Software
基金 国家自然科学基金(60803050 60705022) 新世纪优秀人才计划(NCET-06-0161)
关键词 网页去重算法 小世界网络 近似网页 均方差 duplicate removal algorithm small world network near duplicated Web page standard deviation
  • 相关文献

参考文献2

二级参考文献7

  • 1[1]T.W. Yan and H. Garcia- Molina. Duplicate removal in information dissemination. In Proceedings of the 21st International Conference on Very Large Data Bases(VLDB' 95) ,66 - 77,San Francisco,Ca., USA,September 1995. Morgan Kaufmann Publishers, Inc.
  • 2[2]Narayanan Shivakumar and Hector Garcia- Molina. SCAM: a copy detection mechanism for digital documents. In Proceedings of 2nd International Conference in Theory and Practice of Digital Libraries (DL'95) ,Austin, Texas,June 1995.
  • 3[3]T. Yan and H. Garcia- Molina. The sift information dissemination system. In ACM TODS,2000.
  • 4[4]J.W. Kirriemuir & P. Willett Identification of duplicate and near - duplicate full - text records in database search outputs using hierarchic cluster analysis,in Program-automated library and information,(1995)29(3) :241-256.
  • 5[5]Buckley C. ,Cardie C. ,Mardis S. ,Mitra M. ,Pierce D. ,Wagstaff K. ,Walz J. ,The Smart/Empire TIPSTER IR System, TIPSTER Phase Ⅲ Proceedings,Morgan Kaufmann,San Francisco,CA,2000.
  • 6卢汉清,孔维新,廖明,马颂德.基于内容的视频信号与图像库检索中的图像技术[J].自动化学报,2001,27(1):56-59. 被引量:30
  • 7宋擒豹,沈钧毅.数字商品非法复制和扩散的监测机制[J].计算机研究与发展,2001,38(1):121-125. 被引量:38

共引文献101

同被引文献173

引证文献15

二级引证文献122

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部