Large-quantity and high-quality data is critical to the success of machine learning in diverse applications.Faced with the dilemma of data silos where data is difficult to circulate,emerging data markets attempt to br...Large-quantity and high-quality data is critical to the success of machine learning in diverse applications.Faced with the dilemma of data silos where data is difficult to circulate,emerging data markets attempt to break the dilemma by facilitating data exchange on the Internet.Crowdsourcing,on the other hand,is one of the important methods to efficiently collect large amounts of data with high-value in data markets.In this paper,we investigate the joint problem of efficient data acquisition and fair budget distribution across the crowdsourcing and data markets.We propose a new metric of data value as the uncertainty reduction of a Bayesian machine learning model by integrating the data into model training.Guided by this data value metric,we design a mechanism called Shapley Value Mechanism with Individual Rationality(SV-IR),in which we design a greedy algorithm with a constant approximation ratio to greedily select the most cost-efficient data brokers,and a fair compensation determination rule based on the Shapley value,respecting the individual rationality constraints.We further propose a fair reward distribution method for the data holders with various effort levels under the charge of a data broker.We demonstrate the fairness of the compensation determination rule and reward distribution rule by evaluating our mechanisms on two real-world datasets.The evaluation results also show that the selection algorithm in SV-IR could approach the optimal solution,and outperforms state-of-the-art methods.展开更多
基金supported in part by the National Key Research and Development Program of China under Grant No.2020YFB1707900the National Natural Science Foundation of China under Grant Nos.U2268204,62322206,62132018,62025204,62272307,and 62372296.
文摘Large-quantity and high-quality data is critical to the success of machine learning in diverse applications.Faced with the dilemma of data silos where data is difficult to circulate,emerging data markets attempt to break the dilemma by facilitating data exchange on the Internet.Crowdsourcing,on the other hand,is one of the important methods to efficiently collect large amounts of data with high-value in data markets.In this paper,we investigate the joint problem of efficient data acquisition and fair budget distribution across the crowdsourcing and data markets.We propose a new metric of data value as the uncertainty reduction of a Bayesian machine learning model by integrating the data into model training.Guided by this data value metric,we design a mechanism called Shapley Value Mechanism with Individual Rationality(SV-IR),in which we design a greedy algorithm with a constant approximation ratio to greedily select the most cost-efficient data brokers,and a fair compensation determination rule based on the Shapley value,respecting the individual rationality constraints.We further propose a fair reward distribution method for the data holders with various effort levels under the charge of a data broker.We demonstrate the fairness of the compensation determination rule and reward distribution rule by evaluating our mechanisms on two real-world datasets.The evaluation results also show that the selection algorithm in SV-IR could approach the optimal solution,and outperforms state-of-the-art methods.