摘要
相似性度量在大数据相关应用中具有重要的意义,然而传统余弦相似度遍历计算方法的准确性和时效性较差,具有较大局限性,无法为海量高维数据的质量评估提供有效依据。针对上述问题,利用余切三角函数和数据维度差值构造2种余切相似度公式,提高相似度计算的准确性;借助后向传播(BP)神经网络建立一个能够逼近数据集相似度映射关系的网络模型,降低相似度计算的时间复杂度。实验表明,改进的相似度快速计算方法具有良好的准确性和时效性,而且应用在大规模数据集时的性能提升更显著。
Similarity measurement is of great significance in big data related applications.However,the traditional cosine similarity traversal calculation method has a poor accuracy and timeliness,which cannot provide an effective basis for the quality assessment of massive highdimensional data.To improve the accuracy of similarity calculation,two types of cotangent similarity formulas with cotangent trigonometric function and data dimensional differences was constructed.Besides,a backpropagation(BP)neural network model approximating the similarity mapping relationship of datasets was established to reduce the time complexity.The experimental results demonstrate that the improved fast similarity calculation method has a good accuracy and timeliness.Moreover,it has a more significant performance improvement when applied to large-scale datasets.
作者
乔非
关柳恩
王巧玲
QIAO Fei;GUAN Liuen;WANGE Qiaoling(College of Electronics and Information Engineering,Tongji University,Shanghai 201804,China)
出处
《同济大学学报(自然科学版)》
EI
CAS
CSCD
北大核心
2021年第1期153-162,共10页
Journal of Tongji University:Natural Science
基金
国家自然科学基金(71690230/71690234,61973237,61873191)。
关键词
相似度计算
神经网络
大数据分析
数据质量评估
similarity calculation
neural network
big data analysis
data quality assessment