期刊文献+

基于特征迭代的短文本去重算法 被引量:4

Short Text Duplicate Removal Algorithm Based on Feature Iteration
下载PDF
导出
摘要 由于短文本具有词频单一、结构简单等特点,基于传统特征选取方法的文本去重算法不适合短文本。为此,提出一种适合短文本特点的去重算法,利用SimHash算法产生短文本的指纹,使用共享最近邻算法对指纹进行聚类,根据聚类结果增删初始特征,迭代直至收敛,从而实现短文本的去重检测。在真实数据集上的实验结果表明,与现有的文本去重算法相比,该算法对于短文本具有更好的去重效果。 Because of the single word frequency and the simple structure of short text,algorithms based on normal feature selection methods do not fit to short text.This paper proposes an iteration method of weighting features for short text.It produces the fingerprints of short text using SimHash,and clusters these fingerprints with Shared Nearest Neighbor(SNN).Initial features are added or deleted according to the clusters.This process is circulatory so as to realize the duplicate removal of short text.Experimental results based on two real datasets show that this method fits short text well and has better duplicate removal effects than existing methods.
出处 《计算机工程》 CAS CSCD 北大核心 2015年第12期54-57,63,共5页 Computer Engineering
基金 国家科技支撑计划基金资助项目(2012BAH13F02) 上海市科委基金资助项目(12511502403 12511509602)
关键词 SimHash算法 共享最近邻 迭代 特征选择 短文本 去重 SimHash algorithm Shared Nearest Neighbor(SNN) iteration feature selection short text duplicate removal
  • 相关文献

参考文献14

  • 1Campbell D M,Chen W R, Smith R D. Copy Detection Systems for Digital Documents [ C]//Proceedings of IEEE Advances in Digital Libraries. Washington D. C. , USA : IEEE Press, 2000 : 78-88.
  • 2Si A,Leong H V, Lau R W H. Check: A Document Plagiarism Detection System [ C ]//Proceedings of 1997 ACM Symposium on Applied Computing. New York, USA : ACM Press, 1997:70-77.
  • 3Phan X H,Nguyen L M, Horiguchi S. Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large-scale Data Collections[ C ]//Proceedings of the 17th International Conference on World Wide Web. New York, USA:ACM Press,2008:91-100.
  • 4Charikar M S. Similarity Estimation Techniques from Rounding Algorithms [ C ]//Proceedings of the 34th Annual ACM Symposium on Theory of Computing. New York, USA : ACM Press,2002 : 380-388.
  • 5Bernstein Y,Zobel J. Accurate Discovery of Co-deriva- tive Documents via Duplicate Text Detection [ J]. Info- rmation Systems ,2006,31 (7) :595-609.
  • 6董博,郑庆华,宋凯磊,田锋,马瑞.基于多SimHash指纹的近似文本检测[J].小型微型计算机系统,2011,32(11):2152-2157. 被引量:21
  • 7Wang Meng, Lin Lanfen, Wang Jing, et al. Improving Short Text Classification Using Public Search Engines[M]. Berlin, Germany: Springer-Vertag ,2013.
  • 8Ni Xingliang, Quan Xiaojun, Lu Zhi, et al. Short Text Clustering by Finding Core Terms [ J]. Knowledge and Information Systems ,2011,27 ( 3 ) :345-365.
  • 9Gong Caichun, Huang Yulan, Cheng Xueqi, et al. Detecting Near-duplicates in Large-scale Short Text Databases [ M ]. Berlin, Germany : Springer-Verlag, 2008.
  • 10Coskun B, Giura P. Mitigating SMS Spare by Online Detection of Repetitive Near-duplicate Messages [ C ]// Proceedings of IEEE International Conference on Com- munications. Washington D. C., USA: IEEE Press, 2012:999-1004.

二级参考文献9

  • 1Andrei Z Broder, Steven C Glassman, Mark S Manass~, et al. Syntactic clustering of the Web[ J]. Computer Networks and ISDN Systems, 1997, 29(8-13) :1157-1166.
  • 2Huang Lian-en, Wang Lei, Li Xiao-ming. Achieving both high precision and high recall in near-duplicate detection[A]. In: Pro- ceeding of the 17th ACM Conference on Information and Knowl- edge Management~ C], ACM, 2008: 63-72.
  • 3Moses S Chafikar. Similarity estimation techniques from rounding algorithms[ A] . In: Proceedings of 34th Annual ACM Symposium on Theory of Computing[ C ], ACM, 2002: 380-388.
  • 4Alcksandcr Kolcz, Abdur Chowdhury, Joshua Alspcctor. Improvedrobustness of signature-based near-replica detection via lexicon ran- domization[A]. In: Proceedings of the 10th ACM SIGKDD Inl~r- national Conference on Knowledge Discovery and Data Mining [C], ACM, 2004: 605-610.
  • 5Gurmeet Singh Manku, Arvind Jain, Anish Das Sanna. Detecting near-duplicates for Web crawling[ A]. In: Proceedings of the 16th International Conference on World Wide Web[ C], ACM, 2007: 141-149.
  • 6liang Qi-xia, Sun Mao-song. Semi-supervised SimHash for effi- cient document similarity search[C]. In: Proceedings of the 49th Annual Meeting of the Association for Computa~onal Linguistics, 2011 : 93-101.
  • 7Panagiotis Papadimitriou, Ali Dasdan, Hector Garcia-Molina. Web graph similarity for anomaly detection[ J]. Journal of Internet Serv- ices and Applications,2010, 1 ( 1 ) : 19-30.
  • 8ScoR Huffrnan, April Lehman, Alexei Stolboushkin, et al. Multi- pie-signal duplicate detection for march evaluation[ C ]. In: Pro- ceeding of the 30th Annual International ACM SIGIR Conference, 2007: 223-230.
  • 9张祖平,徐昕,龙军,袁鑫攀.文本相似性度量中参数相关性与优化配置研究[J].小型微型计算机系统,2011,32(5):983-988. 被引量:11

共引文献20

同被引文献36

引证文献4

二级引证文献7

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部