摘要
为了减少数据发布时的信息损失,针对基于聚类的数据发布匿名方案数据可用性较低等问题,文中提出了一种基于混合聚类的k-匿名数据发布算法。相对于传统的单一聚类方法,该算法将密度聚类和划分聚类相结合,依据数据集的密度特征选取初始聚类中心点,利用划分聚类进行迭代实现最优聚类。此外,该方法剔除了数据集中的部分离群点噪声,减小了其对聚类结果的影响。针对混合型数据记录,采用k-means和k-modes结合的距离度量方式,引入桶泛化算法,减少了泛化操作造成的信息损失。实验结果表明,相较于现有方法,基于混合聚类的k-匿名数据发布算法能够有效降低数据匿名的信息损失,提高数据发布的质量。
In order to reduce the loss of information in data publishing,a k-anonymous data publishing algorithm based on hybrid clustering is proposed to solve the problem of low data availability in existing data anonymity schemes based on clustering.Compared with the traditional single clustering method,the proposed algorithm combines partition clustering and distance clustering,selects the initial clustering center point according to the density characteristics of the data set,and uses partition clustering to achieve the optimal clustering iteratively.In addition,the proposed method eliminates part of the outlier noise in the data set to reduce its impact on the clustering results.For hybrid data records,the distance measurement method combining k-means and k-modes is adopted,and the bucket generalization algorithm is introduced to reduce the information loss caused by generalization operation.Experimental results show that compared with the existing methods,the k-anonymity data publishing algorithm based on hybrid clustering can effectively reduce the information loss of data anonymity and improve the quality of data publishing.
作者
方凯
史志才
贾媛媛
FANG Kai;SHI Zhicai;JIA Yuanyuan(School of Electronic and Electrical Engineering,Shanghai University of Engineering Science,Shanghai 201620,China;Shanghai Key Laboratory of Integrated Administration Technologies for Information Security,Shanghai 200240,China)
出处
《电子科技》
2022年第12期78-83,共6页
Electronic Science and Technology
基金
国家自然科学基金(61802252)。
关键词
隐私保护
数据发布
K-匿名
聚类
桶泛化算法
混合属性
网络安全
信息损失
privacy preserving
data publishing
k-anonymity
clustering
bucket generalization algorithm
mixed attributes
network security
information loss