期刊文献+

基于DRPKP算法的文本去重研究与应用 被引量:3

Research and Application on Text Duplication Removal Based on DRPKP Algorithm
下载PDF
导出
摘要 SimHash算法是目前主流的文本去重算法,但它对于特定行业的文本数据在主题方面的天然相似性特点并没有特殊的考虑。基于多年在金融证券行业信息管理和数据整合的经验,本文分析目前文本去重方法存在的问题,特别针对SimtHash算法在特定行业文本去重中的不足,创新地提出一种基于段落主题的文本去重方法(简称DRPKP算法),通过对去重准确率、覆盖率和去重时间3个指标进行对比测试,DRPKP算法比SimHash算法准确率可提高24.5%、覆盖率可提高16.34%,且去重时间更短。 SimHash algorithm is one of the best algorithm for text duplication detection and removal.However,it has less consideration on the naturalsimilarity of text in specific fields.Based on our experience in information management and data integration in financing and securities industry,we analyzemost text duplication removal algorithms today,especially focus onSimHash algorithm,and propose an newalgorithm for text duplication detection and removal which is based on paragraph key phrase(DRPKP).We appliedour algorithm to detect and remove text duplication in real data set onGuo Tai Jun An's Financial Information and Unified Information Retrieval Platform.In comparison withSimHash algorithm,our DRPKPalgorithm performs better with the precision ofduplication removal increased by 24.5%,andthe recallincreased by 16.34%; meanwhile,our DRPKPalgorithm also shows an advantage in operating time.
作者 俞枫 王引娜
出处 《微型电脑应用》 2014年第1期58-60,共3页 Microcomputer Applications
基金 国家科技支撑计划课题"证券与金融产品交易综合服务示范"资助(编号:2012BAH13F03)
关键词 文本去重 段落主题 SimHash 相似度 MAPREDUCE Image Retrieval Gaussian Pyramid Color Histogram
  • 相关文献

参考文献3

二级参考文献14

  • 1中国互联网络信息中心.第十六次中国互联网络发展状况统计报告[EB/OL].http://www.cnnic.net.cn/in-dex/OE/00/11/index.htm,2005,07-01
  • 2Andrei Z. Broder, Steven C. Glassman. Syntactic Clustering of the Web [DB/OL]. http://gatekeeper. research.compaq.com/pub/DEC/SRC/technical--notes/SRC--1997--015 html
  • 3吴军,数学之美系列十三信息指纹及其应用[DB/OL].http://www.googlechinablog.com/2006/08/blog-post.html
  • 4Gurmeet Singh Manku, Arvind Jain, Anish Das Sarma. Detecting Near--Duplicates for Web Crawlng[C]. In ternational World Wide Web Conference, Banff, Alberta, Canada, New York, USA: ACM, 2007: 141-- 150
  • 5Moses S. Charikar, Similarity Estimation Tech niques from Rounding Algorithms[C]. Annual ACM Sym posium on Theory of Computing, Montreal, Quebec, Cana da, New York, USA:ACM, 2002 : 380-388
  • 6Andrei Z Broder, Steven C Glassman, Mark S Manass~, et al. Syntactic clustering of the Web[ J]. Computer Networks and ISDN Systems, 1997, 29(8-13) :1157-1166.
  • 7Huang Lian-en, Wang Lei, Li Xiao-ming. Achieving both high precision and high recall in near-duplicate detection[A]. In: Pro- ceeding of the 17th ACM Conference on Information and Knowl- edge Management~ C], ACM, 2008: 63-72.
  • 8Moses S Chafikar. Similarity estimation techniques from rounding algorithms[ A] . In: Proceedings of 34th Annual ACM Symposium on Theory of Computing[ C ], ACM, 2002: 380-388.
  • 9Alcksandcr Kolcz, Abdur Chowdhury, Joshua Alspcctor. Improvedrobustness of signature-based near-replica detection via lexicon ran- domization[A]. In: Proceedings of the 10th ACM SIGKDD Inl~r- national Conference on Knowledge Discovery and Data Mining [C], ACM, 2004: 605-610.
  • 10Gurmeet Singh Manku, Arvind Jain, Anish Das Sanna. Detecting near-duplicates for Web crawling[ A]. In: Proceedings of the 16th International Conference on World Wide Web[ C], ACM, 2007: 141-149.

共引文献35

同被引文献33

  • 1张校乾,金玉玲,侯丽波.一种基于Lucene检索引擎的全文数据库的研究与实现[J].现代图书情报技术,2005(2):40-43. 被引量:30
  • 2郎小伟,王申康.基于Lucene的全文检索系统研究与开发[J].计算机工程,2006,32(4):94-96. 被引量:68
  • 3刘俊辉.MD5消息摘要算法实现及改进[J].福建电脑,2007,23(4):92-93. 被引量:10
  • 4赵玉玲.基于XML的数据集成技术的研究与实现[博士学位论文].长春:吉林大学,2009.
  • 5Apache.Hadoop.http://hadoop.apache.org/.[2013-06-18].
  • 6黄晓云.基于HDFS的云存储服务系统研究[博士学位论文].大连:大连海事大学,2010.
  • 7Apache Lucene.http://lucene.apache.org/.[2013-02-22].
  • 8Charikar MS.Similarity estimation techniques from rounding algorithms.Proc.of the Thirty-fourth Annual ACM Symposium on Theory of Computing.New York,ACM.2002.380-388.
  • 9张校乾.基于Lucene的全文检索系统的研究与应用[博士学位论文].大连:大连理工大学,2005.
  • 10Manyika J, Chui M, Brown B, et al. Big Data: The Next Frontier for Innovation, Competition, and Productivity. McKinsey Global Institute, 2011.

引证文献3

二级引证文献6

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部