期刊文献+

基于正文结构和长句提取的网页去重算法 被引量:13

Detection and elimination of similar Web pages based on text structure and extraction of long sentences
下载PDF
导出
摘要 针对网页重复的特点和网页正文的结构特征,提出了一种动态的、层次的、鲁棒性强的网页去重算法。该方法通过将网页正文表示成正文结构树的形式,实现了一种动态的特征提取算法和层次指纹的相似度计算算法。特征提取利用长句提取算法保证了强鲁棒性。实验证明,该方法对镜像网页和近似镜像网页都能进行准确的检测。 As regard to the feature of the similarity and that of the text structure of Web pages,this paper proposed a dynamic,stratified and robust algorithm to detect and delete similar Web pages.By this method,expressed the texts of Web pages in the style of text structure trees.Then,thus implemented a dynamic algorithm to extract features of texts and a layer fingerprint algorithm to calculate similarity.That the extraction of the features made use of the algorithm of extraction of long sentences guarantees the robustness.The experimental results show that the method can carry out accurate detection concerning completely similar Web pages and partly similar ones.
出处 《计算机应用研究》 CSCD 北大核心 2010年第7期2489-2491,2497,共4页 Application Research of Computers
基金 重庆市自然科学基金资助项目(CSTC2007BB3169)
关键词 网页去重 正文结构树 长句提取 层次指纹 detection and elimination of similar Web pages text structure tree extraction of long sentences layer fingerprint
  • 相关文献

参考文献7

  • 1王建勇,谢正茂,雷鸣,李晓明.近似镜像网页检测算法的研究与评价[J].电子学报,2000,28(z1):130-132. 被引量:21
  • 2LI Wei,LIU Jian-yi,WANG Cong.Web document duplicate removal algorithm based on keyword sequences[C] //Proc of Natural Language Processing and Knowledge Engineering.Valencia:IEEE Press,2005:511-516.
  • 3HEINTZE N.Scalable document fingerprinting[C] //Proc of the 2nd USENIX Workshop on Electronic Commerce.Oakland,CA:Citeseer,1996:191-200.
  • 4BRODER A Z,GLASSMAN S C,MANASSE M S.Syntactic clustering of the Web[C] //Proc of the 6th International Web Conference.Amsterdam:Elsevier Science Publisher B.V,1997:1157-1166.
  • 5魏丽霞,郑家恒.基于网页文本结构的网页去重[J].计算机应用,2007,27(11):2854-2856. 被引量:13
  • 6刘四维,章轶,夏勇明,钱松荣.基于HTML标记和长句提取的网页去重算法[J].微型电脑应用,2009(8):30-32. 被引量:2
  • 7CORMEN T H,LEISERSON C E,RIVEST R L,et al.Introduction to algorithms[M].Massachusetts:MIT Press,2002:273-293.

二级参考文献15

  • 1王建勇,谢正茂,雷鸣,李晓明.近似镜像网页检测算法的研究与评价[J].电子学报,2000,28(z1):130-132. 被引量:21
  • 2Thomas H. Cormen et al. Introduction to Algorithms[M]. 北京:高等教育出版社,2002.273-293.
  • 3Broder A. Syntactic Clustering of the Web [C] // 6th International World Wide Web Conference Apr. 1997: 393-404.
  • 4Fetterly D. On the Evolution of Clusters of Near- Duplicate Web Pages [C] // 1st Latin American Web Congress. Nov.2003:37-45.
  • 5Rabm M.Fingerprinting by random polynomials.Report TR- 15- 81 [ R ]. Center for Research m Computing Technology, Harvard University,1981.
  • 6Salton G, McGill M.,Introduction to Modem Information Retrieval[M],New York:McGraw-Hill, 1983.
  • 7[1]Narayanan Shivakumar,et al.Finding near-replicas of documents on the web[DB/OL].http://dbpubs.stanford.edu/pub/1998-31.
  • 8[2]J.Liu,M.Lei,J.Wang,and B.Chen.Digging for gold on the web:Experience with the WebGather[A].Proc.of the 4th Inter.Conf.on High Performance Computing in the Asia-Pacific Region[C],Beijing,P.R.China,May 2000:751-755.
  • 9[3]U.Manber.Finding similar files in a large file system[R].Technical Report TR 93-33,University of Arizona,Tuscon,Arizona,October 1993.
  • 10中国互联网信息中心.第十九次中国互联网络发展状况统计报告[EB/OL].[2007-05-05].http://www.cnnic.net.cn/index/OE/00/11/index.htm.

共引文献30

同被引文献127

引证文献13

二级引证文献28

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部