The ongoing data explosion introduced unprecedented challenges to the information security of communication networks.As images are one of the most commonly used information transmission carriers;therefore,their data r...The ongoing data explosion introduced unprecedented challenges to the information security of communication networks.As images are one of the most commonly used information transmission carriers;therefore,their data redundancy analysis and screening are of great significance.However,most of the current research focus on the algorithm improvement of commonly used image datasets.Thus,we should consider an important question:Is there data redundancy in the open datasets?Considering the factors of model structures and data distribution to ensure the generalization,we conducted extensive experiments to compare the average accuracy based on few random data to the baseline accuracy based on all data.The results show serious data redundancy in the open datasets from different domains.For instance,with the aid of deep model,only 20%data can achieve more than 90%of the baseline accuracy.Further,we proposed a novel entropy-based information screening method,which outperforms the random sampling under many experimental conditions.In particular,considering 20%of data,for the shallow model,the improvement is approximately 10%,and for the deep model,the ratio to the baseline accuracy increases to greater than 95%.Moreover,this work can also serve as a new way of learning from a few valuable samples,compressing the size of existing datasets and guiding the construction of high-quality datasets in the future.展开更多
基金This work was supported by the National Natural Science Foundation of China(No.32101612,No.61871283).
文摘The ongoing data explosion introduced unprecedented challenges to the information security of communication networks.As images are one of the most commonly used information transmission carriers;therefore,their data redundancy analysis and screening are of great significance.However,most of the current research focus on the algorithm improvement of commonly used image datasets.Thus,we should consider an important question:Is there data redundancy in the open datasets?Considering the factors of model structures and data distribution to ensure the generalization,we conducted extensive experiments to compare the average accuracy based on few random data to the baseline accuracy based on all data.The results show serious data redundancy in the open datasets from different domains.For instance,with the aid of deep model,only 20%data can achieve more than 90%of the baseline accuracy.Further,we proposed a novel entropy-based information screening method,which outperforms the random sampling under many experimental conditions.In particular,considering 20%of data,for the shallow model,the improvement is approximately 10%,and for the deep model,the ratio to the baseline accuracy increases to greater than 95%.Moreover,this work can also serve as a new way of learning from a few valuable samples,compressing the size of existing datasets and guiding the construction of high-quality datasets in the future.