摘要
网络环境中海量数据具有明显复杂度,存在着大量结构化、半结构化和非结构化的数据,数据块长度与位置易产生较高相似性。当前已有的相似性数据识别属于密集任务型方法,会占用大量的内存空间。为了进一步提高数据利用率,降低数据冗余度,提出基于有序聚类方程的数据相似性识别建模仿真的方法。利用小波技术和重复数据删除技术对网络数据降噪,通过预设数据集中心,完成网络数据特征向量的优化提取。基于此,从时间、空间双维度分析特征向量的相似度,以点云分类网络和有序聚类方程为基础,构建数据相似性识别模型。实验结果表明,利用研究方法识别数据相似性时,其归一化互信息值为0.12,说明上述方法的准确度较高,针对不同规模的待识别数据,研究方法可在0.6s之内完成全部数据相似性的识别。以上实验所得数据证明了该方法具有较高的应用准确率和效率。
Massive data in the network environment has obvious complexity.There are many structured,semistructured and unstructured data.The length and location of data blocks are easy to produce high similarity.At present,the existing similarity data recognition is task intensive methods,which will occupy a lot of memory space.In order to further improve data utilization and reduce data redundancy,a simulation method of data similarity recognition based on ordered clustering equation was proposed.First,wavelet technology and data deduplication technology were used to reduce the noise of network data,and then network data feature vectors were optimized and extracted by presetting the data set center.On this basis,the similarity between feature vectors were analyzed from the dimension of time and space.Based on the point cloud classification network and ordered clustering equation,a model of identifying data similarity was constructed in the end.Following conclusions can be drawn from the experimental results.When the proposed method was adopted to identify data similarity,the normalized mutual information value is 0.12,indicating that the accuracy of method is high.For different sizes of data to be identified,the method can complete the identification of all data similarity within O.6s.These experimental data prove high application accuracy and efficiency of method.
作者
张媛
张慧钧
ZHANG Yuan;ZHANG Hui-jun(School of Modern Manufacturing Engineering,Heilongjiang University of Technology,Jixi Heilongjiang 158100,China;College of modern Manufacturing Engineering,Yan'an University,Yanan Shannxi 716000,China)
出处
《计算机仿真》
北大核心
2023年第4期402-406,共5页
Computer Simulation
基金
黑龙江省自然科学基金资助项目(LH2022A023)。
关键词
小波技术
重复数据删除技术
特征向量相似度
点云分类网络
有序聚类方程
Wavelet technology
Deduplication technology
Eigenvector similarity
Point cloud classification network
Ordered clustering equation