摘要
针对大规模文档去重算法Simhash存在的缺点和不足,提出一种改进的Simhash算法。首先从多个维度综合计算文档的相似度,包括文档内容、文档关键字、文档的标签、文档的引用文献等方面,定义一个新的公式用于计算文档相似度。其次改进Simhash算法计算文档特征的方法,通过TF-IDF技术和单词的主题相关性综合计算单词的权重,TF-IDF技术用于计算一个关键词在一个文档集中的一篇文档的重要性,将专业术语词汇的长度统计函数作为判断单词主题相关性的依据。最后在检索步骤中采用哈希到桶的思想,此时出现分布不均匀的情况,为此设定一个阈值,当超过阈值时,对桶内的元素进行二次哈希,可以减少候选对的数量并且使分布更加均匀。实验结果表明,改进后的算法可以明显提高原Simhash算法的效率和准确率。
Aiming at the shortcomings and deficiencies of Simhash,we present an improved Simhash algorithm.Firstly,the similarity of documents from multiple dimensions is calculated,including document content,document keywords,document labels and references,and a new formula is defined to calculate document similarity.Secondly,the process of Simhash algorithm calculating document features is improved,and the weight of words is calculated synthetically by TF-IDF technique and the topic relevance of words.TF-IDF technology is used to calculate the importance of a document with a keyword in a document set.The term statistical function of term length is used as the basis for determining the relevance of a word subject.Finally,the idea of hashing to buckets is adopted in the retrieval.At this time,there is an uneven distribution,so a threshold is set.When the threshold is exceeded,the elements in the bucket are hashed twice,which can reduce the number of candidate pairs and make the distribution more evenly.Experiment shows that the improved algorithm can significantly improve the efficiency and accuracy of the traditional algorithm.
作者
王诚
王宇成
WANG Cheng;WANG Yu-cheng(School of Telecommunications&Information Engineering,Nanjing University of Posts and Telecommunications,Nanjing 210003,China)
出处
《计算机技术与发展》
2019年第2期115-119,共5页
Computer Technology and Development
基金
江苏省自然科学青年基金(BK20150861)