期刊文献+

网页去重在基于Web企业竞争情报平台中的应用与研究

The Application and Research of Duplicated Web Pages Based on the Web Platform of the Competitive Intelligence
下载PDF
导出
摘要 互联网上大量重复网页的存在,严重地影响信息的检索质量.因此提出了一种基于特征码散列的网页去重算法,利用特征码对特征语句集散列以缩小其比较的范围,提高了网页去重的准确性.实验表明该算法准确率高,性能良好,基于上述算法实现了基于Web的企业情报竞争平台中的网页去重. A large number of duplicated Web pages are on the internet, which has seriously affected the quality of information retrieval. Therefore, the paper has proposed an algorithm based on the signature hashing of the duplicated web pages, using the distributed signature characteristics of the sentences to narrow the relative scope of the pages and to improve the accuracy. The experiment result shows the high accuracy rate and good performance of the algorithm, which helps realize the goal based on the web platform of business competitive intelligence.
出处 《云南民族大学学报(自然科学版)》 CAS 2008年第4期380-382,共3页 Journal of Yunnan Minzu University:Natural Sciences Edition
基金 昆明市科技型中小企业技术创新资助项目(CL2007061)
关键词 网页去重 特征码 特征语句集 duplicated Web pages signature signature characteristics of the sentences
  • 相关文献

参考文献7

  • 1中国互联网信息中心.2005年中国互联网信息资源数量调查报告[EB/OL].[2008-04-30]http://www.cnnic.net.cn/.
  • 2张刚,刘挺,郑实福,等.大规模网页快速去重算法[C].中国中文信息学学会二十周年学术会议论文集(续集),2001.
  • 3陈基漓,牛秦洲.基于特征码的网页去重[J].微计算机信息,2006,22(03X):113-115. 被引量:11
  • 4吴平博,陈群秀,马亮.基于特征串的大规模中文网页快速去重算法研究[J].中文信息学报,2003,17(2):28-35. 被引量:41
  • 5CHO J H, SHIVAKUMAR N, GARCIA- MOLINA H. Finding Replicated Web Collections [ C ]//Proceedings of the ACM Interrnational Conference on Management of the Data. USA: ACM Press, 2000,29(2) :355 -366.
  • 6王建勇,谢正茂,雷鸣,李晓明.近似镜像网页检测算法的研究与评价[J].电子学报,2000,28(z1):130-132. 被引量:21
  • 7CHO J H,GARCIA - MOLILNA H. Finding Near - replicas of Documents on the Web [ C]//Proceedings of Workshop on Web Databases. Spain: Springer Press, 1998 : 204 - 212.

二级参考文献11

  • 1谢立,王永强,于德敏,许增朴.利用图像的灰度特征实现半透明产品的识别[J].微计算机信息,2005,21(07X):44-45. 被引量:10
  • 2[1]T.W. Yan and H. Garcia- Molina. Duplicate removal in information dissemination. In Proceedings of the 21st International Conference on Very Large Data Bases(VLDB' 95) ,66 - 77,San Francisco,Ca., USA,September 1995. Morgan Kaufmann Publishers, Inc.
  • 3[2]Narayanan Shivakumar and Hector Garcia- Molina. SCAM: a copy detection mechanism for digital documents. In Proceedings of 2nd International Conference in Theory and Practice of Digital Libraries (DL'95) ,Austin, Texas,June 1995.
  • 4[3]T. Yan and H. Garcia- Molina. The sift information dissemination system. In ACM TODS,2000.
  • 5[4]J.W. Kirriemuir & P. Willett Identification of duplicate and near - duplicate full - text records in database search outputs using hierarchic cluster analysis,in Program-automated library and information,(1995)29(3) :241-256.
  • 6[5]Buckley C. ,Cardie C. ,Mardis S. ,Mitra M. ,Pierce D. ,Wagstaff K. ,Walz J. ,The Smart/Empire TIPSTER IR System, TIPSTER Phase Ⅲ Proceedings,Morgan Kaufmann,San Francisco,CA,2000.
  • 7[1]Narayanan Shivakumar,et al.Finding near-replicas of documents on the web[DB/OL].http://dbpubs.stanford.edu/pub/1998-31.
  • 8[2]J.Liu,M.Lei,J.Wang,and B.Chen.Digging for gold on the web:Experience with the WebGather[A].Proc.of the 4th Inter.Conf.on High Performance Computing in the Asia-Pacific Region[C],Beijing,P.R.China,May 2000:751-755.
  • 9[3]U.Manber.Finding similar files in a large file system[R].Technical Report TR 93-33,University of Arizona,Tuscon,Arizona,October 1993.
  • 10Finding near-replicas of documents on the web. Narayanan Shivakumar, et al. WebDB 1998

共引文献69

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部