摘要
特征选择可以有效地去除高维数据中的冗余和不相关的特征,保留重要的特征,从而降低模型计算的复杂性,提高模型精度。在特征选择过程中,针对数据中存在的离群点和边界点等可能影响分类效果的噪声数据,提出了基于粗糙集与密度峰值聚类的特征选择方法。首先,通过密度峰值聚类方法去除噪声数据,并挑出簇类中心;然后,结合粗糙集理论的思想,按簇类中心划分数据,并根据同一簇类的点应具有相同标签的假设,定义特征重要性评价指标;最后,设计了一种启发式特征选择算法,用于挑选出使簇类结构纯度更高的特征子集。在6个UCI数据集上,与其他算法进行了分类精度、特征选择个数和运行时间的对比实验,实验结果验证了所提算法的有效性和高效性。
Feature selection can effectively remove redundant and irrelevant features from high-dimensional data and retain important features,thus reducing the complexity of model computation and improving model accuracy.While in feature selection process,to deal with these noisy data that may affect the classification effect,such as outlier points and boundary points,a feature selection method based on rough set and density peak clustering is proposed.At first,noisy data are removed by density peak clustering method and cluster class centers are picked out.Then,the data are divided by cluster class centers by combining the idea of rough set theory,and the feature importance evaluation measure is defined according to the assumption that the data points of same cluster have same label.Finally,a heuristic feature selection algorithm is designed to pick up the feature subset that can makes for a purer homogeneous cluster structure.Experimental comparisons of classification accuracy,number of selected features and running time are conducted with other algorithms on six UCI datasets,and the experimental results verify the effectiveness and efficiency of the proposed algorithm.
作者
曹栋涛
舒文豪
钱进
CAO Dongtao;SHU Wenhao;QIAN Jin(School of Information Engineering,East China Jiaotong University,Nanchang 330013,China;School of software,East China Jiaotong University,Nanchang 330013,China)
出处
《计算机科学》
CSCD
北大核心
2023年第10期37-47,共11页
Computer Science
基金
国家自然科学基金(62266018,61966016)
江西省自然科学基金(20202BABL202037,20232ACB202013,20232BAB202052)
江西省研究生创新基金项目(YC2022-s547)。
关键词
特征选择
高维数据
噪声数据
粗糙集
密度峰值聚类
Feature selection
High-dimensional data
Noisy data
Rough sets
Density peak clustering