摘要
现存的文档相似性算法虽然能够获得2篇文档的相似度,但不能判断出重复或最相似子内容的位置。为此,提出一种基于粒子群优化(PSO)的文档内部子内容的查重算法。利用PSO方法查找2篇文档中最佳相似子内容的位置和长度,设计一种相关函数来判断字符串之间的相似程度,从而得到粒子群的评估函数。测试表明,该查重算法能够快速准确地确定出重复或最相似子内容的位置与长度。
There are some algorithms which can detect similarity among documents,but these algorithms can not detect the duplicated of partial contents in documents.A new effective algorithm of the duplicated of partial contents detection in documents is put forward in this paper.It uses Particle Swarm Optimization(PSO) algorithm to search the optimized partial contents which is the most similar in two documents.For PSO algorithm,it provides the encoding of the particles.A new related coefficient of strings is defined for strings similarity.And the new evaluation function of PSO is designed based on the related coefficient function.The hybrid mutation PSO algorithm is used for searching the most similar partial contents quickly and accurately.Simulation experiments indicate that the algorithm can search the most similar partial contents in two documents effectively.
出处
《计算机工程》
CAS
CSCD
北大核心
2011年第20期203-205,共3页
Computer Engineering
基金
浙江省教育厅基金资助项目(Y200908502)
关键词
查重
相似度函数
粒子群优化
评估函数
字符串
duplicate checking
similarity function
Particle Swarm Optimization(PSO)
evaluation function
character string