网页去重在基于Web企业竞争情报平台中的应用与研究

The Application and Research of Duplicated Web Pages Based on the Web Platform of the Competitive Intelligence

下载PDF

导出

摘要互联网上大量重复网页的存在,严重地影响信息的检索质量.因此提出了一种基于特征码散列的网页去重算法,利用特征码对特征语句集散列以缩小其比较的范围,提高了网页去重的准确性.实验表明该算法准确率高,性能良好,基于上述算法实现了基于Web的企业情报竞争平台中的网页去重. A large number of duplicated Web pages are on the internet, which has seriously affected the quality of information retrieval. Therefore, the paper has proposed an algorithm based on the signature hashing of the duplicated web pages, using the distributed signature characteristics of the sentences to narrow the relative scope of the pages and to improve the accuracy. The experiment result shows the high accuracy rate and good performance of the algorithm, which helps realize the goal based on the web platform of business competitive intelligence.

作者杨申彦黄青松

机构地区昆明理工大学信息工程与自动化学院

出处《云南民族大学学报（自然科学版）》 CAS 2008年第4期380-382,共3页 Journal of Yunnan Minzu University:Natural Sciences Edition

基金昆明市科技型中小企业技术创新资助项目(CL2007061)

关键词网页去重特征码特征语句集 duplicated Web pages signature signature characteristics of the sentences

分类号 TP393.09 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献7

1中国互联网信息中心.2005年中国互联网信息资源数量调查报告[EB/OL].[2008-04-30]http://www.cnnic.net.cn/.
2张刚,刘挺,郑实福,等.大规模网页快速去重算法[C].中国中文信息学学会二十周年学术会议论文集(续集),2001.
3陈基漓,牛秦洲.基于特征码的网页去重[J].微计算机信息,2006,22(03X):113-115. 被引量：11
4吴平博,陈群秀,马亮.基于特征串的大规模中文网页快速去重算法研究[J].中文信息学报,2003,17(2):28-35. 被引量：41
5CHO J H, SHIVAKUMAR N, GARCIA- MOLINA H. Finding Replicated Web Collections [ C ]//Proceedings of the ACM Interrnational Conference on Management of the Data. USA: ACM Press, 2000,29(2) :355 -366.
6王建勇,谢正茂,雷鸣,李晓明.近似镜像网页检测算法的研究与评价[J].电子学报,2000,28(z1):130-132. 被引量：21
7CHO J H,GARCIA - MOLILNA H. Finding Near - replicas of Documents on the Web [ C]//Proceedings of Workshop on Web Databases. Spain: Springer Press, 1998 : 204 - 212.

二级参考文献11

1谢立,王永强,于德敏,许增朴.利用图像的灰度特征实现半透明产品的识别[J].微计算机信息,2005,21(07X):44-45. 被引量：10
2[1]T.W. Yan and H. Garcia- Molina. Duplicate removal in information dissemination. In Proceedings of the 21st International Conference on Very Large Data Bases(VLDB' 95) ,66 - 77,San Francisco,Ca., USA,September 1995. Morgan Kaufmann Publishers, Inc.
3[2]Narayanan Shivakumar and Hector Garcia- Molina. SCAM: a copy detection mechanism for digital documents. In Proceedings of 2nd International Conference in Theory and Practice of Digital Libraries (DL'95) ,Austin, Texas,June 1995.
4[3]T. Yan and H. Garcia- Molina. The sift information dissemination system. In ACM TODS,2000.
5[4]J.W. Kirriemuir & P. Willett Identification of duplicate and near - duplicate full - text records in database search outputs using hierarchic cluster analysis,in Program-automated library and information,(1995)29(3) :241-256.
6[5]Buckley C. ,Cardie C. ,Mardis S. ,Mitra M. ,Pierce D. ,Wagstaff K. ,Walz J. ,The Smart/Empire TIPSTER IR System, TIPSTER Phase Ⅲ Proceedings,Morgan Kaufmann,San Francisco,CA,2000.
7[1]Narayanan Shivakumar,et al.Finding near-replicas of documents on the web[DB/OL].http://dbpubs.stanford.edu/pub/1998-31.
8[2]J.Liu,M.Lei,J.Wang,and B.Chen.Digging for gold on the web:Experience with the WebGather[A].Proc.of the 4th Inter.Conf.on High Performance Computing in the Asia-Pacific Region[C],Beijing,P.R.China,May 2000:751-755.
9[3]U.Manber.Finding similar files in a large file system[R].Technical Report TR 93-33,University of Arizona,Tuscon,Arizona,October 1993.
10Finding near-replicas of documents on the web. Narayanan Shivakumar, et al. WebDB 1998

共引文献69

1谢蕙,秦杰.基于元搜索的网页消重方法研究[J].计算机系统应用,2008,17(8):94-96. 被引量：5
2姚新波,马治坤.基于特征串的网页去重算法[J].科技信息,2008(28). 被引量：3
3曹传东,郭理.一种基于文本抽取的网页正文去重算法[J].科技信息,2009(1):102-103. 被引量：1
4谢瑶兵.基于特征串的网页文本并行去重算法[J].微电子学与计算机,2015,32(2):69-72. 被引量：2
5张明辉,王成耀,宋威.一种基于段落的分段签名近似镜像新算法[J].情报杂志,2005,24(1):21-23. 被引量：2
6魏常丽,刘玉玲.搜索引擎结果去重Agent系统[J].内蒙古科技与经济,2006(02S):82-85.
7连浩,刘悦,许洪波,程学旗.改进的基于布尔模型的网页查重算法[J].计算机应用研究,2007,24(2):36-39. 被引量：7
8黄永光,刘挺,车万翔,胡晓光.面向变异短文本的快速聚类算法[J].中文信息学报,2007,21(2):63-68. 被引量：17
9罗永莲,张永奎.基于发布时间的新闻网页去重方法研究[J].计算机工程与应用,2007,43(6):119-121. 被引量：3
10王鹏,张永奎,张彦,刘睿.基于新闻网页主题要素的网页去重方法研究[J].计算机工程与应用,2007,43(28):177-180. 被引量：7

1宋硕.基于Web信息抽取技术的企业情报分析系统的研究[J].数字技术与应用,2016,34(2):91-92. 被引量：1
2张艳.基于专业搜索引擎的网页去重技术研究[J].软件导刊,2012,11(4):138-141.
3行业资讯[J].交通世界,2009(8):16-20.
4徐娜,刘四维,汪翔,倪卫明.基于Bloom Filter的网页去重算法[J].微型电脑应用,2011(3):48-51. 被引量：6
5周小平,黄家裕,刘连芳,梁一平,申文明.基于网页正文主题和摘要的网页去重算法[J].广西科学院学报,2009,25(4):251-253. 被引量：5
6张玉连,王莎莎,宋桂江.基于元搜索的网页去重算法[J].燕山大学学报,2011,35(2):121-123. 被引量：2
7马辉.网页去重技术问题研究[J].移动信息,2015(8):67-67.
8闫俊伢.基于MD5的网页去重算法的设计与研究[J].实验室研究与探索,2013,32(12):105-108. 被引量：1
9张雯.最好的时代,全面拥抱无线互联网[J].中国广告,2013(8):110-111.
10黄永光,刘挺,车万翔,胡晓光.面向变异短文本的快速聚类算法[J].中文信息学报,2007,21(2):63-68. 被引量：17

云南民族大学学报（自然科学版）

2008年第4期

浏览历史

内容加载中请稍等...

网页去重在基于Web企业竞争情报平台中的应用与研究

参考文献7

二级参考文献11

共引文献69

相关作者

相关机构

相关主题

浏览历史