摘要
针对现有一些特征选择算法未充分考虑特征和标记之间的相关性,造成分类精度偏低,以及ReliefF算法中样本间分类间隔较大导致分类无意义的问题,提出了一种基于标记相关性和改进ReliefF的多标记特征选择方法。首先,根据正类样本在标记集合中的所占比例定义标记权重,结合互信息和标记权重,构建特征与标记集合之间的相关度计算公式,有效反映特征与标记集的相关性,进而提高算法的分类精度。然后,依据ReliefF模型中的距离计算公式,分别计算样本与最近邻异类样本、最近邻同类样本的距离,提出一种新的样本分类间隔,结合标记权重与分类间隔构建新的特征权值更新公式,有效解决了传统ReliefF算法中因样本间距离过大导致异类样本和同类样本失效的问题。最后,结合标记相关性和改进的ReliefF算法,设计了一种新的多标记特征选择算法。在7个多标记数据集上选择不同评价指标,对所提多标记特征选择算法进行仿真实验与分析,实验结果表明所提算法是有效的。
Aiming at the problems that some existing feature selection algorithms do not fully consider the correlation between features and labels,resulting in low classification accuracy,and the large classification margin between samples in the ReliefF algorithm makes classification meaningless,this paper developed a multilabel feature selection method with the label correlation and the improved ReliefF.Firstly,the label weighting was defined according to the proportion of the positive samples in the label set,and the correlation calculation formula between features and label set was constructed by combining the mutual information and the label weighting,which effectively reflects the correlation between features and label set,and further improves the classification accuracy of the algorithm.Secondly,according to the distance calculation formula in the ReliefF model,the distance formulas between the sample and the nearest neighbor heterogeneous sample and between the sample and the nearest neighbor similar sample were presented respectively,and a new sample classification margin was proposed.A novel feature weighting update formula was constructed by combining the label weighting and the new classification margin,which effectively solved the problem of the failure of the heterogeneous and similar samples due to the large distance between samples in the traditional ReliefF algorithm.Finally,a new multilabel feature selection algorithm was designed by combining the label correlation and the improved ReliefF algorithm.The proposed multilabel feature selection algorithm was simulated and analyzed on seven multilabel datasets in terms of different metrics,and the experimental results show that the proposed algorithm is effective.
作者
孙林
杜雯娟
李硕
徐久成
SUN Lin;DU Wenjuan;LI Shuo;XU Jiucheng(College of Computer and Information Engineering,Henan Normal University,Xinxiang 453007,China)
出处
《西北大学学报(自然科学版)》
CAS
CSCD
北大核心
2022年第5期834-846,共13页
Journal of Northwest University(Natural Science Edition)
基金
国家自然科学基金(62076089,61976082)
河南省科技攻关项目(212102210136)。