摘要
手机、可穿戴设备等终端设备每天产生海量数据,但这些数据往往涉及敏感隐私而不能直接公开并使用.为解决隐私保护下的机器学习问题,联邦学习应运而生,旨在通过构建协同训练机制,在不共享客户端数据条件下,训练高性能全局模型.然而,在实际应用中,现有联邦学习机制面临两大不足:(1)全局模型需考虑多个客户端的数据,但各客户端往往仅包含部分类别数据且类别间数据量严重不均衡,使得全局模型难以训练;(2)各客户端之间的数据分布往往存在较大差异,导致各客户端模型往往差异较大,使得传统通过模型参数加权平均以获得全局模型的方法难以奏效.为降低客户端类别不均衡和数据分布差异的影响,本文提出一种基于数据生成的类别均衡联邦学习(Class-Balanced Federated Learning,CBFL)方法.CBFL旨在通过数据生成技术,针对各客户端构造符合全局模型学习的类别均衡数据集.为此,CBFL设计了一个包含类别均衡采样器和数据生成器的类别分布均衡器.其中,类别均衡采样器对客户端数据量不足的类别以较高概率进行采样.然后,数据生成器则根据所采样的类别生成相应的虚拟数据以均衡客户端数据的类别分布并用于后续的模型训练.为验证所提出方法的有效性,本文在四个标准数据集上进行了大量实验.实验表明,本文方法可大幅提升联邦学习性能:如在CIFAR-100数据集上,CBFL训练的ResNet20模型与现有方法相比,分类准确率提高了5.82%.
Modern terminal devices such as mobile phones and wearable devices produce massive amounts of data every day,but these data often involve sensitive privacy and thus cannot be di-rectly disclosed and used.To solve this problem,Federated Learning(FL)has been developed as an important machine learning framework under privacy protection,which allows extensive ter-minal devices/clients to collaboratively learn a superior global model,without sharing the private data on the clients.However,in practical application,there are still two underlying limitations to existing FL mechanism.First,the global model needs to consider the data on multiple clients,but each client usually contains only partial classes of data and the data amount of different clas-ses is severely imbalanced,making it difficult to train the global model.Specifically,most data on the client belong to a few classes,while other classes have few or no data.As a result,the trained local models tend to overfit the data on the clients and achieve poor performance on global data,which severely affects the training of the global model.Second,the data distribution is ex tremely different across the clients,which causes the trained models on each client to be quite dif ferent,making it hard to derive a promising global model.In fact,the training data on each cli ent usually come from the usage of the terminal device by a particular user.Due to the differences in the functions of the terminal devices and the usage habits of users,different clients often pro duce different classes of data,leading to extremely different class distribution across the data on the clients.Consequently,there will be huge differences among the local models trained on such distribution,making it difficult to obtain a superior global model through the traditional approach of element-wise weighted averaging model parameters.To reduce the impact of class imbalance and distribution differences,in this paper,we propose a novel Class Balanced Federated Learning(CBFL)method based on data generation,which aims to produce a class-balanced data set suitable for the training of global model for each client through data generation technique.To this end,CBFL designs a class distribution equalizer that consists of a class-balanced sampler and a data generator.First,the class-balanced sampler samples those classes that have insufficient data on the client with a higher sample probability.Then,the data generator generates corresponding dummy data according to the classes sampled by the class balanced sampler.Finally,each client combines its original data and the generated data to produce a class balanced data set for training.In this way,the performance of each local model can be greatly improved and the differences among local models are highly reduced,which contributes to obtaining a promising global model.Moreover,to obtain high-quality generated data,we exploit global data distribution information from the global model to train the data generator.Extensive experiments on four benchmark datasets demonstrate the superior performance of the proposed method over existing methods.For example,the ResNet20 model trained on CIFAR-100 dataset by the proposed CBFL outperforms existing methods by 5.82%in terms of accuracy.
作者
李志鹏
国雍
陈耀佛
王耀威
曾炜
谭明奎
LI Zhi-Peng;GUO Yong;CHEN Yao-Fo;WANG Yao-Wei;ZENG Wei;TAN Ming-Kui(School of Software Engineering,South China University of Technology,Guangzhou 510006;Artificial Intelligence Research Center,Peng Cheng Laboratory,Shenzhen,guangdong 518054;School of Electronics Engineering and Computer Science,Peking University,Beijing 100871)
出处
《计算机学报》
EI
CAS
CSCD
北大核心
2023年第3期609-625,共17页
Chinese Journal of Computers
基金
科技部青年项目(2020AAA0106900)
国家自然科学基金联合基金项目(U20B2052)
国家自然科学基金项目(62072190)
广东省重点领域研发计划项目(2018B010107001)
广东省珠江人才计划创新创业团队(2017ZT07X183)资助.
关键词
联邦学习
数据生成
类别分布
类别不均衡
隐私保护
federated learning
data generation
class distribution
class imbalance
privacy protection