在现实世界的分类任务中,不平衡数据通常呈现非线性分布的特点,而传统的抽样方法难以有效处理这些非线性,导致分类效果不佳。为了解决这个问题,本文提出了一种基于核主成分分析(KPCA)的欠抽样方法。该方法通过使用非线性核函数将原始数...在现实世界的分类任务中,不平衡数据通常呈现非线性分布的特点,而传统的抽样方法难以有效处理这些非线性,导致分类效果不佳。为了解决这个问题,本文提出了一种基于核主成分分析(KPCA)的欠抽样方法。该方法通过使用非线性核函数将原始数据映射到适当的高维空间使其线性化,然后根据每个样本在核主成分上的得分来选择性地删除多数类样本,从而实现欠抽样。在9组具有不同平衡率的数据集上,采用本文提出的方法进行了欠抽样预处理,并使用逻辑回归(Logistic Regression)分类器进行分类。实验结果表明,在Accuracy、F1-measure和AUC值三个指标中,本文方法分别在7组、8组和9组数据集上取得了最高评分。这表明该方法在不平衡数据集上具有良好的分类性能。The unbalanced data in the real classification task are mostly characterized by nonlinear distribution, and the traditional sampling method is not good at dealing with this kind of nonlinearity resulting in unsatisfactory sample classification effect. Aiming at this problem, an under-sampling method based on KPCA is proposed. The method maps the original data to a suitable high-dimensional space to make it linearly divisible by nonlinearly transforming the kernel function, and de-redundantly removes the majority class by calculating the scores of individual samples on the kernel principal components in order to achieve the purpose of under-sampling. After the under-sampling preprocessing of nine datasets with different balance rates, the classification is performed using Logistic Regression classifier model. The experimental results show that the algorithm of this paper obtains the highest evaluation metrics under Accuracy, F1-measure and AUC value scores under 7, 8 and 9 groups of datasets, respectively, which shows that the method has a good classification performance on unbalanced datasets.展开更多
BIRCH算法是一种适合处理大规模数值型的聚类算法,但现实生活中的数据往往是混合型数据,导致了BIRCH算法的局限性;此外,在使用BIRCH算法进行聚类分析的过程中存在隐私泄露的风险,而传统的中心化差分隐私算法存在需要可信第三方的缺点。...BIRCH算法是一种适合处理大规模数值型的聚类算法,但现实生活中的数据往往是混合型数据,导致了BIRCH算法的局限性;此外,在使用BIRCH算法进行聚类分析的过程中存在隐私泄露的风险,而传统的中心化差分隐私算法存在需要可信第三方的缺点。针对以上缺陷,提出了基于本地差分隐私的BIRCH混合数据(LDP-BIRCH)算法,对混合型数据中的非数值型数据进行编码处理,并使用本地差分隐私对数据集进行扰动,将扰动后的数据集发给第三方进行BIRCH算法聚类分析。研究结果表明,LDP-BIRCH算法在adult和Facebook Live Sellers in Thailand数据集上满足隐私保护性和聚类可用性。展开更多
文摘在现实世界的分类任务中,不平衡数据通常呈现非线性分布的特点,而传统的抽样方法难以有效处理这些非线性,导致分类效果不佳。为了解决这个问题,本文提出了一种基于核主成分分析(KPCA)的欠抽样方法。该方法通过使用非线性核函数将原始数据映射到适当的高维空间使其线性化,然后根据每个样本在核主成分上的得分来选择性地删除多数类样本,从而实现欠抽样。在9组具有不同平衡率的数据集上,采用本文提出的方法进行了欠抽样预处理,并使用逻辑回归(Logistic Regression)分类器进行分类。实验结果表明,在Accuracy、F1-measure和AUC值三个指标中,本文方法分别在7组、8组和9组数据集上取得了最高评分。这表明该方法在不平衡数据集上具有良好的分类性能。The unbalanced data in the real classification task are mostly characterized by nonlinear distribution, and the traditional sampling method is not good at dealing with this kind of nonlinearity resulting in unsatisfactory sample classification effect. Aiming at this problem, an under-sampling method based on KPCA is proposed. The method maps the original data to a suitable high-dimensional space to make it linearly divisible by nonlinearly transforming the kernel function, and de-redundantly removes the majority class by calculating the scores of individual samples on the kernel principal components in order to achieve the purpose of under-sampling. After the under-sampling preprocessing of nine datasets with different balance rates, the classification is performed using Logistic Regression classifier model. The experimental results show that the algorithm of this paper obtains the highest evaluation metrics under Accuracy, F1-measure and AUC value scores under 7, 8 and 9 groups of datasets, respectively, which shows that the method has a good classification performance on unbalanced datasets.
文摘BIRCH算法是一种适合处理大规模数值型的聚类算法,但现实生活中的数据往往是混合型数据,导致了BIRCH算法的局限性;此外,在使用BIRCH算法进行聚类分析的过程中存在隐私泄露的风险,而传统的中心化差分隐私算法存在需要可信第三方的缺点。针对以上缺陷,提出了基于本地差分隐私的BIRCH混合数据(LDP-BIRCH)算法,对混合型数据中的非数值型数据进行编码处理,并使用本地差分隐私对数据集进行扰动,将扰动后的数据集发给第三方进行BIRCH算法聚类分析。研究结果表明,LDP-BIRCH算法在adult和Facebook Live Sellers in Thailand数据集上满足隐私保护性和聚类可用性。