期刊文献+

基于Newshingling的相似文本检测算法 被引量:1

A Similar Text Detection Algorithm Based on Newshingling
下载PDF
导出
摘要 目的构造一种新的文本查重算法,改变传统的Shingling网页去重算法,提高文本的相似度计算率,提高查准率和查全率.方法改变传统的Shingling算法,先删除文本中无意义的虚词,再根据语意对文本进行分片,进而利用文本相似度计算公式对文本相似度进行计算.结果该算法提高了文本相似度计算的准确性,同时文本的查准率提高了10%左右,查全率提高了5%左右.结论实验表明,笔者所提算法实现简单、可行、具有良好的文本相似度计算效果,具有一定的优越性. The objective of the paper is to construct a new text searching repetition algorithm in computer algorithm in order to change the traditional Shingling page re-algorithm,and to improve the similarity computation rate of the text,improve the precision and recall.We take measures to change the traditional shingling algorithm.First,we delete the text's meaningless function word,slice the text according to the semantic;then,use text similarity formula to calculate the similarity of the text.Through the algorithm in the calculation of text similarity,the accuracy of text similarity computation is increased,the text of the precision and recall rate are enhanced as well.The experiment shows that the algorithm is simple and feasible,with good text similarity calculation,the method is superior.
出处 《沈阳建筑大学学报(自然科学版)》 CAS 北大核心 2011年第4期771-775,共5页 Journal of Shenyang Jianzhu University:Natural Science
基金 辽宁省教育厅基金项目(L2010449)
关键词 空间向量模型 文本相似度 Shingling算法 分词 VSM text similarity shingling algorithm segmentation
  • 相关文献

参考文献14

  • 1Gurmeet Singh Manku.Detecting near duplicates for web crawling[J].International World Wide Web Conference Committee,2007,21 (5):141-149.
  • 2Bharat K,Broder A Z,Dean J,et al.A comparison of techniques to find mirrored hosts on the WWW[J].Journal of the American Society for Information Science (JASIS),2000,10 (3):1114-1122.
  • 3Broder A, Glassman S, Manasse S. Syntactic cluste- ring of the web[ J]. Proceedings of the Sixth Interna- tional World Wide Web Conference( WWW), 1997, 26(9) :391 -404.
  • 4Heintze N.Scalable document fingerprinting[J].Proceedings of the Second USENIX Electronic Commerce Workshop (Oakland),1996,15 (6):191.
  • 5吴平博,陈群秀,马亮.基于特征串的大规模中文网页快速去重算法研究[J].中文信息学报,2003,17(2):28-35. 被引量:41
  • 6Ye Shaozhi,Wen Jirong.A systematic study on parameter correlations in large scale duplicate document detection[J].Proceedings of the 10th Pacific-Asia Conference on Knowledge Discovery and Data Mining,2006,76(7):275-284.
  • 7Richchardson M,Prakash A,Bill M.Beyond pagerank:machine learning for static ranking[J].Association for Computing Machinery,2006,56(10):705-715.
  • 8Manku G S,Jain A,Sarma A D.Detecting near-duplicates for web crawling[J].In WWW 2007,2007,15 (8):141-149.
  • 9Yang H,Callan J.Near-duplicate detection by instance-level constrained clustering[J].In SIGIR06,2006,78(11):421-428.
  • 10Stein B.Principles of hash-based text retrieval[J].In SIGIR'07,2007,79(12):527-534.

二级参考文献5

  • 1[1]T.W. Yan and H. Garcia- Molina. Duplicate removal in information dissemination. In Proceedings of the 21st International Conference on Very Large Data Bases(VLDB' 95) ,66 - 77,San Francisco,Ca., USA,September 1995. Morgan Kaufmann Publishers, Inc.
  • 2[2]Narayanan Shivakumar and Hector Garcia- Molina. SCAM: a copy detection mechanism for digital documents. In Proceedings of 2nd International Conference in Theory and Practice of Digital Libraries (DL'95) ,Austin, Texas,June 1995.
  • 3[3]T. Yan and H. Garcia- Molina. The sift information dissemination system. In ACM TODS,2000.
  • 4[4]J.W. Kirriemuir & P. Willett Identification of duplicate and near - duplicate full - text records in database search outputs using hierarchic cluster analysis,in Program-automated library and information,(1995)29(3) :241-256.
  • 5[5]Buckley C. ,Cardie C. ,Mardis S. ,Mitra M. ,Pierce D. ,Wagstaff K. ,Walz J. ,The Smart/Empire TIPSTER IR System, TIPSTER Phase Ⅲ Proceedings,Morgan Kaufmann,San Francisco,CA,2000.

共引文献40

同被引文献5

引证文献1

二级引证文献2

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部