摘要
分析数据集合相似度是数据挖掘任务的核心。例如,删除Web搜索中的重复结果,常用的方法是查看页面的Jaccard指数。在社会网络分析中,另一个常见度量是adam-adar指数,在预测链接问题中被广泛使用。然而,随着要处理的数据量的增加,计算所有数据对之间的精确相似度可能变得难以处理。对于这个任务,目前主流的估计模型有MinHash和Sim Hash,它们一般用于处理大量重复数据,如文档重复数据删除系统等。但是考虑到目前任务的重要性,对更高效的估计模型的需求是显而易见的。文章提出了使用Dot Hash——一种两集合相交大小的无偏差估计量的模型。DotHash可以用来估计Jaccard指数,也可以估计adam-adar指数。实验结果表明,DotHash在链接预测和检测重复文档方面比其他模型更准确。
Analysis data set similarity is the core of the data mining tasks.Delete the repetition of Web search results,for exam-ple,the commonly used method is to look at page Jaccard index.In social network analysis,another common metric is Adam-adar index,is widely used in the forecasting link problem.However,with the increase of the amount of data to be processed,cal-culate all data on the accuracy of the similarity between may become difficult to process.For this task,the current mainstream MinHash and SimHash estimation model,they are used for processing a large number of duplicate data,such as document data deduplication system,etc.But considering the importance of the task,the demand for more efficient estimation model is obvious.So the paper put forward the use DotHash,the size of a intersection set of two model without deviation estimator.DotHash can be used to estimate the Jaccard index,but also can estimate the Adam-adar index.The experimental results show that the Do-tHash repeat documents in link prediction and detection model is more accurate than others.
作者
魏鹏
WEI Peng(Guangdong Baiyun College,Guangdong,Guangzhou,510450)
出处
《长江信息通信》
2023年第11期146-148,共3页
Changjiang Information & Communications