Purpose: This paper aims to improve the classification performance when the data is imbalanced by applying different sampling techniques available in Machine Learning.Design/methodology/approach: The medical appointme...Purpose: This paper aims to improve the classification performance when the data is imbalanced by applying different sampling techniques available in Machine Learning.Design/methodology/approach: The medical appointment no-show dataset is imbalanced, and when classification algorithms are applied directly to the dataset, it is biased towards the majority class, ignoring the minority class. To avoid this issue, multiple sampling techniques such as Random Over Sampling(ROS), Random Under Sampling(RUS), Synthetic Minority Oversampling TEchnique(SMOTE), ADAptive SYNthetic Sampling(ADASYN), Edited Nearest Neighbor(ENN), and Condensed Nearest Neighbor(CNN) are applied in order to make the dataset balanced. The performance is assessed by the Decision Tree classifier with the listed sampling techniques and the best performance is identified.Findings: This study focuses on the comparison of the performance metrics of various sampling methods widely used. It is revealed that, compared to other techniques, the Recall is high when ENN is applied CNN and ADASYN have performed equally well on the Imbalanced data.Research limitations: The testing was carried out with limited dataset and needs to be tested with a larger dataset.Practical implications: This framework will be useful whenever the data is imbalanced in real world scenarios, which ultimately improves the performance.Originality/value: This paper uses the rebalancing framework on medical appointment no-show dataset to predict the no-shows and removes the bias towards minority class.展开更多
智能电表故障的准确预测对实现计量设备精准主动运维、保障电网稳定运行具有重要意义。电表各故障类型样本的出现频次不同,且不同故障类型样本在高维特征空间中的分布存在重叠,这极大增加了故障预测的难度。现有不平衡分类方法通过构建...智能电表故障的准确预测对实现计量设备精准主动运维、保障电网稳定运行具有重要意义。电表各故障类型样本的出现频次不同,且不同故障类型样本在高维特征空间中的分布存在重叠,这极大增加了故障预测的难度。现有不平衡分类方法通过构建单一样本信息与其对应类别标签的映射关系来划分样本类型,导致对具有相似表征信息的重叠区样本难以准确判别,降低了整体分类精度。该文提出一种基于多粒度近邻图的智能电表故障分类方法。首先,选择原始数据集中样本作为目标样本,以目标样本及其近邻样本作为节点、目标样本与其近邻样本连线作为边构建近邻图。根据选择的近邻样本数量不同构建多粒度近邻图,实现目标样本的信息扩充和训练样本的数量扩增,更有利于模型稳定训练。构建编码器挖掘近邻图节点特征,利用图注意力机制,根据近邻图节点编码特征和节点邻接关系将近邻样本信息自适应地聚合到目标样本,实现对相似样本差异的有效挖掘。对于给定测试样本,通过集成测试样本多粒度近邻图的分类结果,得到更精准、更鲁棒的智能电表故障预测结果。在20个KEEL(knowledge extraction based on evolutionary learning)和UCI(UC Irvine machine learning repository)不平衡分类公开数据集和智能电表实际故障数据集上的大量实验结果表明,与17种典型方法相比,该文所提算法在处理智能电表故障分类问题上具有显著优势。展开更多
多标签分类任务广泛存在于现实生活中,然而其经常存在不均衡数据问题,严重影响了分类性能.目前解决该问题的主流技术为重采样方法,主要分为过采样和欠采样,过采样通过生成与少数类标签相关的样本,欠采样则是通过删除与多数类标签相关的...多标签分类任务广泛存在于现实生活中,然而其经常存在不均衡数据问题,严重影响了分类性能.目前解决该问题的主流技术为重采样方法,主要分为过采样和欠采样,过采样通过生成与少数类标签相关的样本,欠采样则是通过删除与多数类标签相关的样本.然而,这些方法都专注于解决一种不均衡问题,即标签内不均衡或标签间不均衡,导致在解决一种不均衡的同时可能引入另一种不均衡.针对该问题,本文提出一种基于安全欠采样的不均衡多标签数据集成学习方法ESUS(Ensemble learning method based on Safe Under-Sampling).首先通过标签划分将多标签不均衡数据集划分成单标签数据集和标签对数据集,针对单标签数据集,提出一种安全欠采样方法解决标签内不均衡问题,并利用采样后的均衡数据集构建二分类模型.对于标签对数据集,进行数据剪枝后利用集成学习解决标签间不均衡问题,在保持分类性能的同时降低时空复杂度.最后将单标签数据集模型和标签对数据集模型集成为最终的分类模型.在六个多标签不均衡数据集上的实验结果表明:和七种对比方法相比,ESUS方法在四个评价指标上更稳定有效.展开更多
文摘Purpose: This paper aims to improve the classification performance when the data is imbalanced by applying different sampling techniques available in Machine Learning.Design/methodology/approach: The medical appointment no-show dataset is imbalanced, and when classification algorithms are applied directly to the dataset, it is biased towards the majority class, ignoring the minority class. To avoid this issue, multiple sampling techniques such as Random Over Sampling(ROS), Random Under Sampling(RUS), Synthetic Minority Oversampling TEchnique(SMOTE), ADAptive SYNthetic Sampling(ADASYN), Edited Nearest Neighbor(ENN), and Condensed Nearest Neighbor(CNN) are applied in order to make the dataset balanced. The performance is assessed by the Decision Tree classifier with the listed sampling techniques and the best performance is identified.Findings: This study focuses on the comparison of the performance metrics of various sampling methods widely used. It is revealed that, compared to other techniques, the Recall is high when ENN is applied CNN and ADASYN have performed equally well on the Imbalanced data.Research limitations: The testing was carried out with limited dataset and needs to be tested with a larger dataset.Practical implications: This framework will be useful whenever the data is imbalanced in real world scenarios, which ultimately improves the performance.Originality/value: This paper uses the rebalancing framework on medical appointment no-show dataset to predict the no-shows and removes the bias towards minority class.
文摘智能电表故障的准确预测对实现计量设备精准主动运维、保障电网稳定运行具有重要意义。电表各故障类型样本的出现频次不同,且不同故障类型样本在高维特征空间中的分布存在重叠,这极大增加了故障预测的难度。现有不平衡分类方法通过构建单一样本信息与其对应类别标签的映射关系来划分样本类型,导致对具有相似表征信息的重叠区样本难以准确判别,降低了整体分类精度。该文提出一种基于多粒度近邻图的智能电表故障分类方法。首先,选择原始数据集中样本作为目标样本,以目标样本及其近邻样本作为节点、目标样本与其近邻样本连线作为边构建近邻图。根据选择的近邻样本数量不同构建多粒度近邻图,实现目标样本的信息扩充和训练样本的数量扩增,更有利于模型稳定训练。构建编码器挖掘近邻图节点特征,利用图注意力机制,根据近邻图节点编码特征和节点邻接关系将近邻样本信息自适应地聚合到目标样本,实现对相似样本差异的有效挖掘。对于给定测试样本,通过集成测试样本多粒度近邻图的分类结果,得到更精准、更鲁棒的智能电表故障预测结果。在20个KEEL(knowledge extraction based on evolutionary learning)和UCI(UC Irvine machine learning repository)不平衡分类公开数据集和智能电表实际故障数据集上的大量实验结果表明,与17种典型方法相比,该文所提算法在处理智能电表故障分类问题上具有显著优势。
文摘多标签分类任务广泛存在于现实生活中,然而其经常存在不均衡数据问题,严重影响了分类性能.目前解决该问题的主流技术为重采样方法,主要分为过采样和欠采样,过采样通过生成与少数类标签相关的样本,欠采样则是通过删除与多数类标签相关的样本.然而,这些方法都专注于解决一种不均衡问题,即标签内不均衡或标签间不均衡,导致在解决一种不均衡的同时可能引入另一种不均衡.针对该问题,本文提出一种基于安全欠采样的不均衡多标签数据集成学习方法ESUS(Ensemble learning method based on Safe Under-Sampling).首先通过标签划分将多标签不均衡数据集划分成单标签数据集和标签对数据集,针对单标签数据集,提出一种安全欠采样方法解决标签内不均衡问题,并利用采样后的均衡数据集构建二分类模型.对于标签对数据集,进行数据剪枝后利用集成学习解决标签间不均衡问题,在保持分类性能的同时降低时空复杂度.最后将单标签数据集模型和标签对数据集模型集成为最终的分类模型.在六个多标签不均衡数据集上的实验结果表明:和七种对比方法相比,ESUS方法在四个评价指标上更稳定有效.