摘要
针对现有的特征选择方法对衡量不同类别数据重叠/分离能力的不足,提出了一种用于评价特征的二类区分能力的干扰熵方法(IET-CD)。对于包含两个类别(正类和负类)样本的特征,首先,计算正类数据范围内的负类样本的混合条件概率,以及负类样本归属于正类的概率;然后,由混合条件概率和归属概率计算混淆概率,再利用混淆概率计算正类干扰熵,同理,计算负类干扰熵;最后,将正、负类干扰熵之和作为该特征的二类干扰熵。干扰熵用于评价特征对二类样本的区分能力,该特征的干扰熵值小,表明该特征的二类区分能力强,反之则弱。在3个UCI数据集和1个模拟基因表达数据集上,每个方法挑选出5个最优特征,并对比了这些特征的二类区分能力,由此比较这些方法的性能。实验结果表明:所提方法与NEFS方法相比,二类区分能力相当或更好;与单索引近邻熵特征选择(SNEFS)方法、相关性最大冗余性最小特征选择(MRMR)算法、联合互信息(JMI)方法、Relief方法相比,绝大多数情况都是所提方法获胜。IET-CD方法能有效地选择二类区分能力更好的特征。
Aiming at the existing feature selection methods lacking the ability to measure the overlap/separation of different classes of data,an Interference Entropy of Two-Class Distinguishing(IET-CD)method was proposed to evaluate the two-class distinguishing ability of features.For the feature containing two classes(positive and negative),firstly,the mixed conditional probability of the negative class samples within the range of positive class data and the probability of the negative class samples belonging to the positive class were calculated;then,the confusion probability was calculated by the mixed conditional probability and attribution probability,and the confusion probability was used to calculate the positive interference entropy.In the similar way,the negative interference entropy was calculated.Finally,the sum of positive and negative interference entropies was taken as the two-class interference entropy of the feature.The interference entropy was used to evaluate the distinguishing ability of the feature to the two-class sample.The smaller the interference entropy value of the feature,the stronger the two-class distinguishing ability of the feature.On three UCI datasets and one simulated gene expression dataset,five optimal features were selected for each method,and the two-class distinguishing ability of the features were compared,so as to compare the performance of the methods.The experimental results show that the proposed method is equivalent or better than the NEFS(Neighborhood Entropy Feature Selection)method,and compared with the Single-indexed Neighborhood Entropy Feature Selection(SNEFS),feature selection based on Max-Relevance and MinRedundancy(MRMR),Joint Mutual Information(JMI)and Relief method,the proposed method is better in most cases.The IET-CD method can effectively select features with better two-class distinguishing ability.
作者
曾元鹏
王开军
林崧
ZENG Yuanpeng;WANG Kaijun;LIN Song(College of Mathematics and Informatics,Fujian Normal University,Fuzhou Fujian 350007,China;Digit Fujian Internet-of-Things Laboratory of Environmental Monitoring,Fujian Normal University,Fuzhou Fujian 350007,China)
出处
《计算机应用》
CSCD
北大核心
2020年第3期626-630,共5页
journal of Computer Applications
基金
国家自然科学基金资助项目(61672157,61772134)
福建省自然科学基金资助项目(2018J01778)
中国博士后科学基金资助项目(2016M600494)~~
关键词
特征选择
二类区分能力
条件概率
干扰熵
feature selection
two-class distinguishing ability
conditional probability
interference entropy