摘要
为了在大规模文档去重中提高相似数据检测的精度,对基于Simhash算法的大规模文档去重技术进行深入研究。在原有算法的基础之上对Simhash签名值的计算过程作出改进,引入ICTCLAS分词技术,将TF-IDF技术作为计算权重的主要方法,同时将特征值的词性与词长两大影响因素考虑其中。然后对产生的签名值进行汉明距离的比较,从而精确地判定出待比较者是否为相似数据。实验结果表明:改进的算法性能得到提高,并且总体优于Shingle算法和原Simhash算法。通过提高签名值的精度能够实现大规模文档中相似技术的精确检测,达到理想的去重效果。
To improve the detecting accuracy of approximately duplicated records in extensive data de-du- plication, an extensive data de-duplication technology based on Simhash algorithm is studied. Based on the existing algorithms, Simhash algorithm has made an improvement in calculation process to introduce ICTCLAS word segmentation technology and gain weight value, it sets the TF-IDF technology as the main method for calculating weight value. Furthermore, the part-of-speech and the word length are introduced as a considered weighting factor, then comparing the hamming distances between signatures are compared to accurately identify whether they are alike. The simulation results show that the modified algorithm has high accuracy .and recall rate, and the detection performance of is superior to the Shingle algorithm and the prime algorithm. By improving the accuracy of the signature value, it can realize the accurate detec- tion of extensive data de-duplication, thus achieving the perfect results.
出处
《南京邮电大学学报(自然科学版)》
北大核心
2016年第3期85-91,共7页
Journal of Nanjing University of Posts and Telecommunications:Natural Science Edition
基金
国家自然科学基金(11501302)资助项目