期刊文献+

文本相似性度量中参数相关性与优化配置研究 被引量:11

Parameters Correlation and Optimization in Text Similarity Measurement
下载PDF
导出
摘要 针对文本相似性度量中的相似度阈值、准确率、召回率、shingle滑动窗口大小、shingle权重系数和文本属性等参数相互影响、关系复杂的问题,研究了这些参数之间的相关性,并结合实际应用需求,提出各参数可优化配置的建议,分析与设计了相似度阈值可适应文本篇幅属性的相似性度量算法.通过某基金2009年的7378个项目申请书的比对分析,结果表明:提出的算法不但适用于大规模的文本集合,而且在短小的文本集合中进行相似性度量也具有很高的应用价值,其准确率和召回率均可高达95%以上. Parameters in text similarity measurement such as similarity threshold,precision,recall rate,size of shingle moving window,shingle weighted coefficient and text attributes are interrelated and their relationship are complicated.Based on the analysis of parameter correlations and practical requirements,we suggest optimized parameter configurations and design a similarity measurement algorithm to adjust the similarity threshold to the text length contribute.The algorithm is applied to the text similarity analysis of 7378 proposals for some fund in 2009.The results demonstrate that,no matter the text length is long or short,the algorithm is so efficient that the precision and recall rate are both higher than 95%.
出处 《小型微型计算机系统》 CSCD 北大核心 2011年第5期983-988,共6页 Journal of Chinese Computer Systems
基金 国家自然科学基金项目(60873081 60970095 M0921005)资助 湖南省自然(07JJ6122)资助
关键词 文本相似性度量 算法 邻接词组 参数相关性分析 召回率 text similarity measurement algorithm shingle parameters correlation analysis recall rate
  • 相关文献

参考文献3

二级参考文献12

  • 1中国互联网络信息中心.第十六次中国互联网络发展状况统计报告[EB/OL].http://www.cnnic.net.cn/in-dex/OE/00/11/index.htm,2005,07-01
  • 2Andrei Z. Broder, Steven C. Glassman. Syntactic Clustering of the Web [DB/OL]. http://gatekeeper. research.compaq.com/pub/DEC/SRC/technical--notes/SRC--1997--015 html
  • 3吴军,数学之美系列十三信息指纹及其应用[DB/OL].http://www.googlechinablog.com/2006/08/blog-post.html
  • 4Gurmeet Singh Manku, Arvind Jain, Anish Das Sarma. Detecting Near--Duplicates for Web Crawlng[C]. In ternational World Wide Web Conference, Banff, Alberta, Canada, New York, USA: ACM, 2007: 141-- 150
  • 5Moses S. Charikar, Similarity Estimation Tech niques from Rounding Algorithms[C]. Annual ACM Sym posium on Theory of Computing, Montreal, Quebec, Cana da, New York, USA:ACM, 2002 : 380-388
  • 6[1]T.W. Yan and H. Garcia- Molina. Duplicate removal in information dissemination. In Proceedings of the 21st International Conference on Very Large Data Bases(VLDB' 95) ,66 - 77,San Francisco,Ca., USA,September 1995. Morgan Kaufmann Publishers, Inc.
  • 7[2]Narayanan Shivakumar and Hector Garcia- Molina. SCAM: a copy detection mechanism for digital documents. In Proceedings of 2nd International Conference in Theory and Practice of Digital Libraries (DL'95) ,Austin, Texas,June 1995.
  • 8[3]T. Yan and H. Garcia- Molina. The sift information dissemination system. In ACM TODS,2000.
  • 9[4]J.W. Kirriemuir & P. Willett Identification of duplicate and near - duplicate full - text records in database search outputs using hierarchic cluster analysis,in Program-automated library and information,(1995)29(3) :241-256.
  • 10[5]Buckley C. ,Cardie C. ,Mardis S. ,Mitra M. ,Pierce D. ,Wagstaff K. ,Walz J. ,The Smart/Empire TIPSTER IR System, TIPSTER Phase Ⅲ Proceedings,Morgan Kaufmann,San Francisco,CA,2000.

共引文献115

同被引文献43

  • 1高凯,王永成,肖君.网页去重策略[J].上海交通大学学报,2006,40(5):775-777. 被引量:13
  • 2郭双宙,梁金兰.构件库用户反馈子系统的客观反馈的设计[J].计算机技术与发展,2007,17(5):129-132. 被引量:2
  • 3陈友,程学旗,李洋,戴磊.基于特征选择的轻量级入侵检测系统[J].软件学报,2007,18(7):1639-1651. 被引量:78
  • 4Andrei Z Broder, Steven C Glassman, Mark S Manass~, et al. Syntactic clustering of the Web[ J]. Computer Networks and ISDN Systems, 1997, 29(8-13) :1157-1166.
  • 5Huang Lian-en, Wang Lei, Li Xiao-ming. Achieving both high precision and high recall in near-duplicate detection[A]. In: Pro- ceeding of the 17th ACM Conference on Information and Knowl- edge Management~ C], ACM, 2008: 63-72.
  • 6Moses S Chafikar. Similarity estimation techniques from rounding algorithms[ A] . In: Proceedings of 34th Annual ACM Symposium on Theory of Computing[ C ], ACM, 2002: 380-388.
  • 7Alcksandcr Kolcz, Abdur Chowdhury, Joshua Alspcctor. Improvedrobustness of signature-based near-replica detection via lexicon ran- domization[A]. In: Proceedings of the 10th ACM SIGKDD Inl~r- national Conference on Knowledge Discovery and Data Mining [C], ACM, 2004: 605-610.
  • 8Gurmeet Singh Manku, Arvind Jain, Anish Das Sanna. Detecting near-duplicates for Web crawling[ A]. In: Proceedings of the 16th International Conference on World Wide Web[ C], ACM, 2007: 141-149.
  • 9liang Qi-xia, Sun Mao-song. Semi-supervised SimHash for effi- cient document similarity search[C]. In: Proceedings of the 49th Annual Meeting of the Association for Computa~onal Linguistics, 2011 : 93-101.
  • 10Panagiotis Papadimitriou, Ali Dasdan, Hector Garcia-Molina. Web graph similarity for anomaly detection[ J]. Journal of Internet Serv- ices and Applications,2010, 1 ( 1 ) : 19-30.

引证文献11

二级引证文献61

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部