基于Newshingling的相似文本检测算法被引量：1

A Similar Text Detection Algorithm Based on Newshingling

下载PDF

导出

摘要目的构造一种新的文本查重算法,改变传统的Shingling网页去重算法,提高文本的相似度计算率,提高查准率和查全率.方法改变传统的Shingling算法,先删除文本中无意义的虚词,再根据语意对文本进行分片,进而利用文本相似度计算公式对文本相似度进行计算.结果该算法提高了文本相似度计算的准确性,同时文本的查准率提高了10%左右,查全率提高了5%左右.结论实验表明,笔者所提算法实现简单、可行、具有良好的文本相似度计算效果,具有一定的优越性. The objective of the paper is to construct a new text searching repetition algorithm in computer algorithm in order to change the traditional Shingling page re-algorithm,and to improve the similarity computation rate of the text,improve the precision and recall.We take measures to change the traditional shingling algorithm.First,we delete the text＇s meaningless function word,slice the text according to the semantic;then,use text similarity formula to calculate the similarity of the text.Through the algorithm in the calculation of text similarity,the accuracy of text similarity computation is increased,the text of the precision and recall rate are enhanced as well.The experiment shows that the algorithm is simple and feasible,with good text similarity calculation,the method is superior.

作者赵德平蔡丽静李鹏

机构地区沈阳建筑大学理学院沈阳建筑大学信息与控制工程学院

出处《沈阳建筑大学学报（自然科学版）》 CAS 北大核心 2011年第4期771-775,共5页 Journal of Shenyang Jianzhu University：Natural Science

基金辽宁省教育厅基金项目(L2010449)

关键词空间向量模型文本相似度 Shingling算法分词 VSM text similarity shingling algorithm segmentation

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献14

1Gurmeet Singh Manku.Detecting near duplicates for web crawling[J].International World Wide Web Conference Committee,2007,21 (5):141-149.
2Bharat K,Broder A Z,Dean J,et al.A comparison of techniques to find mirrored hosts on the WWW[J].Journal of the American Society for Information Science (JASIS),2000,10 (3):1114-1122.
3Broder A, Glassman S, Manasse S. Syntactic cluste- ring of the web[ J]. Proceedings of the Sixth Interna- tional World Wide Web Conference( WWW), 1997, 26(9) :391 -404.
4Heintze N.Scalable document fingerprinting[J].Proceedings of the Second USENIX Electronic Commerce Workshop (Oakland),1996,15 (6):191.
5吴平博,陈群秀,马亮.基于特征串的大规模中文网页快速去重算法研究[J].中文信息学报,2003,17(2):28-35. 被引量：41
6Ye Shaozhi,Wen Jirong.A systematic study on parameter correlations in large scale duplicate document detection[J].Proceedings of the 10th Pacific-Asia Conference on Knowledge Discovery and Data Mining,2006,76(7):275-284.
7Richchardson M,Prakash A,Bill M.Beyond pagerank:machine learning for static ranking[J].Association for Computing Machinery,2006,56(10):705-715.
8Manku G S,Jain A,Sarma A D.Detecting near-duplicates for web crawling[J].In WWW 2007,2007,15 (8):141-149.
9Yang H,Callan J.Near-duplicate detection by instance-level constrained clustering[J].In SIGIR06,2006,78(11):421-428.
10Stein B.Principles of hash-based text retrieval[J].In SIGIR'07,2007,79(12):527-534.

二级参考文献5

1[1]T.W. Yan and H. Garcia- Molina. Duplicate removal in information dissemination. In Proceedings of the 21st International Conference on Very Large Data Bases(VLDB' 95) ,66 - 77,San Francisco,Ca., USA,September 1995. Morgan Kaufmann Publishers, Inc.
2[2]Narayanan Shivakumar and Hector Garcia- Molina. SCAM: a copy detection mechanism for digital documents. In Proceedings of 2nd International Conference in Theory and Practice of Digital Libraries (DL'95) ,Austin, Texas,June 1995.
3[3]T. Yan and H. Garcia- Molina. The sift information dissemination system. In ACM TODS,2000.
4[4]J.W. Kirriemuir & P. Willett Identification of duplicate and near - duplicate full - text records in database search outputs using hierarchic cluster analysis,in Program-automated library and information,(1995)29(3) :241-256.
5[5]Buckley C. ,Cardie C. ,Mardis S. ,Mitra M. ,Pierce D. ,Wagstaff K. ,Walz J. ,The Smart/Empire TIPSTER IR System, TIPSTER Phase Ⅲ Proceedings,Morgan Kaufmann,San Francisco,CA,2000.

共引文献40

1谢蕙,秦杰.基于元搜索的网页消重方法研究[J].计算机系统应用,2008,17(8):94-96. 被引量：5
2姚新波,马治坤.基于特征串的网页去重算法[J].科技信息,2008(28). 被引量：3
3曹传东,郭理.一种基于文本抽取的网页正文去重算法[J].科技信息,2009(1):102-103. 被引量：1
4谢瑶兵.基于特征串的网页文本并行去重算法[J].微电子学与计算机,2015,32(2):69-72. 被引量：2
5魏常丽,刘玉玲.搜索引擎结果去重Agent系统[J].内蒙古科技与经济,2006(02S):82-85.
6连浩,刘悦,许洪波,程学旗.改进的基于布尔模型的网页查重算法[J].计算机应用研究,2007,24(2):36-39. 被引量：7
7黄永光,刘挺,车万翔,胡晓光.面向变异短文本的快速聚类算法[J].中文信息学报,2007,21(2):63-68. 被引量：17
8罗永莲,张永奎.基于发布时间的新闻网页去重方法研究[J].计算机工程与应用,2007,43(6):119-121. 被引量：3
9钱爱兵,江岚.基于后缀树的中文新闻重复网页识别算法[J].现代图书情报技术,2008(3):55-61. 被引量：6
10陈锦言,孙济洲,张亚平.基于傅立叶变换的网页去重算法[J].计算机应用,2008,28(4):948-950. 被引量：2

同被引文献5

1费洪晓,康松林,朱小娟,谢文彪.基于词频统计的中文分词的研究[J].计算机工程与应用,2005,41(7):67-68. 被引量：68
2张瑾.基于改进TF-IDF算法的情报关键词提取方法[J].情报杂志,2014,33(4):153-155. 被引量：63
3王洪亚,吴西送,任建军,赵银龙.分布式平台下MinHash算法研究与实现[J].智能计算机与应用,2014,4(6):44-46. 被引量：5
4李镇君,周竹荣.基于Document Triage的TF-IDF算法的改进[J].计算机应用,2015,35(12):3506-3510. 被引量：14
5吴致晖,刘洪伟,陈丽.高效朴素贝叶斯Web新闻文本分类模型的简易实现[J].统计学与应用,2014,3(1):30-35. 被引量：2

引证文献1

1于重重,曹帅,张青川,尹蔚彬,孙沁瑶,刘畅.濒危语言与汉语平行语料库动态构建技术研究[J].计算机应用与软件,2018,35(9):63-66. 被引量：2

二级引证文献2

1邓璐芗,许鑫.数字人文人工智能平台的设计与实现——以ECNU-DHAI平台为例[J].图书馆杂志,2021,40(3):78-85. 被引量：4
2刘冬霞,刘建国,陈曼倩,赵丹.装备制造业工业汉语平行语料库的搭建与问题的探讨[J].软件,2022,43(4):7-9.

1毛晓蛟.一种双层网页去重方法研究[J].电脑编程技巧与维护,2010(20):66-67.
2张艳.基于专业搜索引擎的网页去重技术研究[J].软件导刊,2012,11(4):138-141.
3徐娜,刘四维,汪翔,倪卫明.基于Bloom Filter的网页去重算法[J].微型电脑应用,2011(3):48-51. 被引量：6
4周小平,黄家裕,刘连芳,梁一平,申文明.基于网页正文主题和摘要的网页去重算法[J].广西科学院学报,2009,25(4):251-253. 被引量：5
5张玉连,王莎莎,宋桂江.基于元搜索的网页去重算法[J].燕山大学学报,2011,35(2):121-123. 被引量：2
6叶庆卫,武冬星,周宇,王晓东.基于粒子群优化的文档子内容查重算法[J].计算机工程,2011,37(20):203-205.
7刘利俊,吴达胜.一种高效的文本查重算法在电子商务中的应用[J].计算机应用与软件,2009,26(1):197-199. 被引量：1
8黄菊.一种基于语义向量空间模型的作业查重算法[J].电子科学技术,2016,3(6):786-789. 被引量：2
9陈志刚,张伟.网页资源的消重研究[J].电子技术与软件工程,2015(8):39-39.
10马辉.网页去重技术问题研究[J].移动信息,2015(8):67-67.

沈阳建筑大学学报（自然科学版）

2011年第4期

浏览历史

内容加载中请稍等...

基于Newshingling的相似文本检测算法被引量：1

参考文献14

二级参考文献5

共引文献40

同被引文献5

引证文献1

二级引证文献2

相关作者

相关机构

相关主题

浏览历史

基于Newshingling的相似文本检测算法 被引量：1

参考文献14

二级参考文献5

共引文献40

同被引文献5

引证文献1

二级引证文献2

相关作者

相关机构

相关主题

浏览历史

基于Newshingling的相似文本检测算法被引量：1