摘要
在线流特征选择通过实时过滤无关特征和冗余特征,实现流特征空间降维.针对已有算法,如Alpha-investing分类精度低、SAOLA选择特征数多和OSFS在低冗余高相关数据集下运行时间长的问题,提出了一种面向分类的流特征在线特征选择算法——OSFIC.算法运用四层过滤框架,通过无条件独立过滤不相关新特征、单条件下互信息过滤冗余新特征和候选特征集合中的部分冗余特征,最后通过多条件独立过滤候选特征集中的剩余冗余特征,最终得到分类标签的近似马尔可夫毯.为了分析OSFIC的性能,选择了NIPS 2003和Causality Workbench中的数据集,从预测精度、特征数量、运行时间和AUC方面与已有基准算法进行比较.实验表明,OSFIC平均分类精度比Alpha-investing提升4.41%.在保证精度的前提下,平均特征数量比SAOLA减少41.9%,运行时间比OSFS减少91.59%.最后,在真实的应用场景下验证了OSFIC的有效性.
Online streaming feature selection achieves stream feature space dimensionality reduction by filtering irrele-vant features and redundant features in real time.Existing works,such as Alpha-investing and Online Streaming Feature Se-lection(OSFS),have been proposed to serve this purpose,but they have drawbacks,including low prediction accuracy and high running time if the streaming features exhibit characteristics such as low redundancy and high relevance.We propose a novel classification-oriented online feature selection algorithm for streaming features,named OSFIC.OSFIC uses a four-layer filtering framework to filter irrelevant new features by null-conditional independence,filter redundant new features and re-dundant features in a candidate feature set by a single-conditional mutual information,and finally filter the remaining redun-dancy in the candidate feature set by multi-conditional independence.The approximate Markov blanket of the classify label is finally obtained.To analyze the performance of the algorithm,we selected the datasets in NIPS 2003 and Causality Work-bench to compare prediction accuracy,number of selected features,runtime,and AUC with existing state-of-the-art algo-rithms.Experiments show that the average classification accuracy of OSFIC is 4.41%higher than that of Alpha-investing.Under the premise of high precision,the average number of features is 41.9%lower than SAOLA,and the runtime is 91.59%lower than OSFS.Finally,the efficiency of OSFIC is verified in real scenarios.
作者
尤殿龙
郭松
赵春慧
原福永
申利民
陈真
YOU Dian-long;GUO Song;ZHAO Chun-hui;YUAN Fu-yong;SHEN Li-min;CHEN Zhen(School of Information Science and Engineering,Yanshan University,Qinhuangdao,Hebei 066004,China;The Key Laboratory for Computer Virtual Technology and System Integration of Hebei Province,Qinhuangdao,Hebei 066004,China)
出处
《电子学报》
EI
CAS
CSCD
北大核心
2020年第2期321-332,共12页
Acta Electronica Sinica
基金
国家自然科学基金(No.61772450)
中国博士后科学基金(No.2018M631764)
河北省自然科学基金(No.F2019203287,No.F2017203307)
河北省科技计划项目(No.17210701D)
河北省博士后科研项目(No.B2018003009)
河北省教育厅科学研究计划项目(No.KCJSX2017028)
燕山大学基础研究专项课题(No.16SKY011)
燕山大学博士基金(No.BL18003)。
关键词
在线特征选择
流特征
互信息
条件独立
近似马尔可夫毯
online feature selection
streaming feature
mutual information
conditional independence
approximate Markov blanket