期刊文献+

基于发布时间的新闻网页去重方法研究 被引量:3

Research on duplicated news webpages deletion method based on issue time
下载PDF
导出
摘要 网页检索结果中,用户经常会得到内容相同的冗余页面。它们不但浪费了存储资源,而且给信息检索或其它文本处理带来诸多不便。论文在抽取出新闻标题、主题内容和发布日期的前提下,依据新闻的时间性(易碎性),按发布日期分“群”,对冗余网页去重方法进行了探索性研究,从而很大程度地缩小了计算时间,提高了去重准确性。 In the homepage retrieval result,users often get the redundant page with same content.It not only wa set the storing resources,but also bring a great deal of inconvenience to information retrieval or other text-processing.We first extract the news title,the subject content and the issue date in this article,then divide group according to data issued on the basis of news fragility and conduct the exploration research to duplicated web pages removal.It greatly reduces the computing time,enhances the duplicated news webpages deletion accuracy.
出处 《计算机工程与应用》 CSCD 北大核心 2007年第6期119-121,共3页 Computer Engineering and Applications
基金 国家自然科学基金(the National Natural Science Foundation of China under Grant No.60475022) 山西省自然科学基金(the NaturalScience Foundation of Shanxi Province of China under Grant No.20041041) 山西省留学回国人员基金项目(No.2002004)。
关键词 新闻网页 主题内容抽取 网页去重 权值计算 news webpages theme's extraction duplicated web pages removal weight calculating
  • 相关文献

参考文献4

  • 1吴平博,陈群秀,马亮.基于特征串的大规模中文网页快速去重算法研究[J].中文信息学报,2003,17(2):28-35. 被引量:41
  • 2Yan T W,Molina H G.Duplicate removal in information dissemination[C]//Proceedings of the 21st International Conference on Very Large Data Bases,1995:66-77.
  • 3Liu C J,Wechsler H.A shape-and texture-based enhanced Fisher classifier for face recognition[J].IEEE Transactions on Image Processing,2001,10(4):598-608.
  • 4张刚,刘挺,郑实福,等.大规模网页快速去重算法[C].中国中文信息学学会二十周年学术会议论文集(续集),2001.

二级参考文献5

  • 1[1]T.W. Yan and H. Garcia- Molina. Duplicate removal in information dissemination. In Proceedings of the 21st International Conference on Very Large Data Bases(VLDB' 95) ,66 - 77,San Francisco,Ca., USA,September 1995. Morgan Kaufmann Publishers, Inc.
  • 2[2]Narayanan Shivakumar and Hector Garcia- Molina. SCAM: a copy detection mechanism for digital documents. In Proceedings of 2nd International Conference in Theory and Practice of Digital Libraries (DL'95) ,Austin, Texas,June 1995.
  • 3[3]T. Yan and H. Garcia- Molina. The sift information dissemination system. In ACM TODS,2000.
  • 4[4]J.W. Kirriemuir & P. Willett Identification of duplicate and near - duplicate full - text records in database search outputs using hierarchic cluster analysis,in Program-automated library and information,(1995)29(3) :241-256.
  • 5[5]Buckley C. ,Cardie C. ,Mardis S. ,Mitra M. ,Pierce D. ,Wagstaff K. ,Walz J. ,The Smart/Empire TIPSTER IR System, TIPSTER Phase Ⅲ Proceedings,Morgan Kaufmann,San Francisco,CA,2000.

共引文献49

同被引文献43

引证文献3

二级引证文献12

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部