期刊文献+

基于关键长句及正文长度预分类的网页去重算法研究 被引量:2

Research on Based on Web Algorithm Based on Key Long Sentence and Text Length Preliminary Classification
下载PDF
导出
摘要 伴随互联网所包含网页数目的剧增,转载现象变得相当普遍。作为提高搜索引擎服务质量的关键问题之一,网页去重技术已经成为网页信息处理最为重要的环节。在对传统网页去重技术进行研究的基础上,针对网页正文的结构特征,提出了一种基于关键长句及正文长度预分类的网页去重算法的核心思想。实验证明,该算法具有较高的召回率及准确率,在重复网页的过滤中有着较好的应用前景与较高的研究价值。 As to improve the quality of search engine service one of the key problems, web page to heavy technology has become web information processing is the most important link. Based on the traditional web page to heavy technology, based on the features of the structure of web text, this paper puts forward a kind of based on key long sentences and tex^t length preliminary classification of web page to heavy algorithm core ideas. Experiments show that the algorithm has high- er recall ratio and accuracy, the duplicated web pages in the filter has a good application prospect and high research value.
作者 周杨
出处 《软件导刊》 2012年第10期48-50,共3页 Software Guide
关键词 网页去重 关键长句 预分类 Web Page Key Long Sentence Preliminary Classification
  • 相关文献

参考文献4

二级参考文献18

  • 1王建勇,谢正茂,雷鸣,李晓明.近似镜像网页检测算法的研究与评价[J].电子学报,2000,28(z1):130-132. 被引量:21
  • 2陈基漓,牛秦洲.基于特征码的网页去重[J].微计算机信息,2006,22(03X):113-115. 被引量:11
  • 3[1]T.W. Yan and H. Garcia- Molina. Duplicate removal in information dissemination. In Proceedings of the 21st International Conference on Very Large Data Bases(VLDB' 95) ,66 - 77,San Francisco,Ca., USA,September 1995. Morgan Kaufmann Publishers, Inc.
  • 4[2]Narayanan Shivakumar and Hector Garcia- Molina. SCAM: a copy detection mechanism for digital documents. In Proceedings of 2nd International Conference in Theory and Practice of Digital Libraries (DL'95) ,Austin, Texas,June 1995.
  • 5[3]T. Yan and H. Garcia- Molina. The sift information dissemination system. In ACM TODS,2000.
  • 6[4]J.W. Kirriemuir & P. Willett Identification of duplicate and near - duplicate full - text records in database search outputs using hierarchic cluster analysis,in Program-automated library and information,(1995)29(3) :241-256.
  • 7[5]Buckley C. ,Cardie C. ,Mardis S. ,Mitra M. ,Pierce D. ,Wagstaff K. ,Walz J. ,The Smart/Empire TIPSTER IR System, TIPSTER Phase Ⅲ Proceedings,Morgan Kaufmann,San Francisco,CA,2000.
  • 8[4]J.Zhou,P.Larson,J.C.Freytag,W.Lehner.Efficient Exploitation of Similar Subexpressions for Query Processing.ACM SIGMOD,2007:533-544.
  • 9[6]Junghoo Cho.N.Shivakumar et al.Finding replicated web collections.In Proceedings of 2000 ACM International Conference on Management of Data(SIGMOD),May 2000.
  • 10[7]Shaozhi Ye,Ji-RongWen,Wei-Ying Ma.A systematic study on parameter correlations in large-scale duplicate document detection.Knowledge and Information Systems,2007,14:217-232.

共引文献50

同被引文献10

引证文献2

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部