摘要
SimHash算法是目前主流的文本去重算法,但它对于特定行业的文本数据在主题方面的天然相似性特点并没有特殊的考虑。基于多年在金融证券行业信息管理和数据整合的经验,本文分析目前文本去重方法存在的问题,特别针对SimtHash算法在特定行业文本去重中的不足,创新地提出一种基于段落主题的文本去重方法(简称DRPKP算法),通过对去重准确率、覆盖率和去重时间3个指标进行对比测试,DRPKP算法比SimHash算法准确率可提高24.5%、覆盖率可提高16.34%,且去重时间更短。
SimHash algorithm is one of the best algorithm for text duplication detection and removal.However,it has less consideration on the naturalsimilarity of text in specific fields.Based on our experience in information management and data integration in financing and securities industry,we analyzemost text duplication removal algorithms today,especially focus onSimHash algorithm,and propose an newalgorithm for text duplication detection and removal which is based on paragraph key phrase(DRPKP).We appliedour algorithm to detect and remove text duplication in real data set onGuo Tai Jun An's Financial Information and Unified Information Retrieval Platform.In comparison withSimHash algorithm,our DRPKPalgorithm performs better with the precision ofduplication removal increased by 24.5%,andthe recallincreased by 16.34%; meanwhile,our DRPKPalgorithm also shows an advantage in operating time.
出处
《微型电脑应用》
2014年第1期58-60,共3页
Microcomputer Applications
基金
国家科技支撑计划课题"证券与金融产品交易综合服务示范"资助(编号:2012BAH13F03)