期刊文献+

基于CNN的安防数据相似重复记录检测模型

APPROXIMATELY DUPLICATE RECORD DETECTION MODEL FOR SECURITY DATA BASED ON CNN
下载PDF
导出
摘要 安防行业的结构化数据中存在大量的相似重复记录,传统的相似重复记录检测算法的识别率很难满足安防行业的实际需求。针对这种情况,引入了卷积神经网络模型,设计两种以LeNet-5模型为基础的改进模型,一种是输入为词向量矩阵的模型,另一种是输入为相似度矩阵的模型。实验表明,输入为词向量矩阵的模型的精确率和召回率均达到了96%以上,输入为相似度矩阵的模型的精确率和召回率高达98%,并且K折交叉验证的结果说明模型具有较强的泛化能力。 There are a lot of approximately duplicate record in the structured data of security industry.The recognition rate of traditional approximately duplicate record detection algorithm is difficult to meet the actual demand of security industry.In order to solve the above problems,a convolutional neural network model was introduced and two improved models based on LeNet-5 model were designed.One was the model with input as word embedding matrix,the other is the model with input as similarity matrix.The experiments show that the precision rate and recall rate of the model with input as word embedding matrix reach more than 96%.And the precision rate and recall rate of the model with input as a similarity matrix reach up to 98%.The experimental results of K-fold cross validation show that both models have strong generalization ability.
作者 王巍 刘阳 洪惠君 梁雅静 Wang Wei;Liu Yang;Hong Huijun;Liang Yajing(School of Information&Electrical Engineering,Hebei University of Engineering,Handan 056038,Hebei,China;Hebei Key Laboratory of Security&Protection Information Sensing and Processing,Handan 056038,Hebei,China;School of Internet of Things Engineering,Jiangnan University,Wuxi 214122,Jiangsu,China)
出处 《计算机应用与软件》 北大核心 2023年第2期17-25,共9页 Computer Applications and Software
基金 国家自然科学基金项目(61802107) 教育部-中国移动科研基金项目(MCM20170204) 江苏省博士后科研资助计划项目(1601085C)。
关键词 安防行业 数据清洗 相似重复记录检测 CNN LeNet-5 Security industry Data cleaning Approximately duplicate record detection CNN LeNet-5
  • 相关文献

参考文献12

二级参考文献61

共引文献172

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部