摘要
大数据时代的数据挖掘技术在研究和应用等领域取得了较大发展,但大量敏感信息披露给用户带来了众多威胁和损失。因此,在聚类分析过程中如何保护数据隐私成为数据挖掘和数据隐私保护领域的热点问题。传统差分隐私保护k-means算法对其初始中心点的选择较为敏感,而且在聚簇个数k值的选择上存在一定的盲目性,降低了聚类结果的可用性。为了进一步提高差分隐私k-means聚类方法聚类结果的可用性,研究并提出一种新的基于差分隐私的DPk-means-up聚类算法,同时进行了理论分析和比较实验。理论分析表明,该算法满足ε-差分隐私,可适用于不同规模和不同维度的数据集。此外,实验结果表明,在相同隐私保护级别下,与其他差分隐私k-means聚类方法相比,所提算法有效提高了聚类的可用性。
Data mining has made great progress in the field of research and application of big data,but sensitive information disclosure could bring users many threats and losses.Therefore,how to protect data privacy in clustering analysis has become a hot issue in data mining and data privacy protection.Traditional differential privacy k-means is sensitive to the selection of its initial centers,and it has a certain blindness in the selection of cluster number k,which reduces the availability of clustering results.To improve the availability of clustering results of differential privacy k-means clustering,this paper presented a new DPk-means-up clustering algorithm based on differential privacy and carried out theoretical analysis and comparison experiment.Theoretical analysis shows that the algorithm satisfiesε-differential privacy,and can be applied to data sets with different sizes and dimensions.In addition,experimental results indicate that the proposed algorithm improves clustering availability than other differential privacy k-means clustering methods at the same level of privacy preserve.
作者
胡闯
杨庚
白云璐
HU Chuang;YANG Geng;BAI Yun-lu(College of Computer Science,Nanjing University of Posts and Telecommunications,Nanjing 210003,China;Jiangsu Key Laboratory of Big Data Security&Intelligent Processing,Nanjing 210023,China;College of Information Technology,Nanjing University of Chinese Medicine,Nanjing 210023,China)
出处
《计算机科学》
CSCD
北大核心
2019年第2期120-126,共7页
Computer Science
基金
国家自然科学基金项目(61572263)
江苏省自然科学基金政策引导类计划--前瞻性联合研究项目(2016ZS04)资助
关键词
差分隐私
K-均值
聚类算法
隐私保护
Differential privacy
k-means
Clustering algorithms
Privacy preserving