摘要
针对网页欺诈检测中特征的高维、冗余问题,提出一个基于信息增益和遗传算法的改进特征选择算法(IFS-BIGGA)。首先,通过信息增益(IG)给出特征重要性排序,设定动态阈值减少冗余特征;其次,改进遗传算法(GA)中染色体编码函数和选择算子,并结合随机森林(RF)的受试者工作特征曲线面积(AUC)作为适应度函数,选择高辨识度特征;最后,增加实验迭代次数避免算法随机性,产生最佳最小的特征集合(OMFS)。实验验证表明,应用IFS-BIGGA生成的OMFS与高维特征集合相比,尽管RF下的AUC减小了2%,但是真阳性率(TPR)提高了21%,并且特征维度减少了92%;同时多个常用分类器的平均检测时间减少了83%;另外,IFS-BIGGA的F1值相比传统的遗传算法(TGA)和帝国主义竞争算法(ICA)分别提高了4.2%和3.5%。实验结果表明,IFS-BIGGA可以进行高效特征降维,在实际的网页检测工程中,有效减少计算代价,提高检测效率。
Focusing on the issue that features used in Web spam detection are always high-dimensional and redundant, an Improved Feature Selection method Based on Information Gain and Genetic Algorithm (IFS-BIGGA) was proposed. Firstly, the priorities of features were ranked by Information Gain (IG), and dynamic threshold was set to get rid of redundant features. Secondly, the function of chromosome encoding was modified and the selection operator was improved in Genetic Algorithm (GA). After that, the Area Under receiver operating Characteristic (AUC) of Random Forest (RF) classifier was utilized as the fitness function to pick up the features with high degree of identification. Finally, the Optimal Minimum Feature Set (OMFS) was obtained by increasing the experimental iteration to avoid the randomness of the proposed algorithm. The experimental results show that OMFS, compared to the high-dimensional feature set, although the AUC under RF is decreased by 2%, the True Positive Rate (TPR) is increased by 21% and the feature dimension is reduced by 92%. And the average detecting time is decreased by 83%; moreover, by comparing to the Traditional GA (TGA) and Imperialist Competitive Algorithm (ICA), the F1 score under Bayes Net (BN) is increased by 4.2% and 3.5% respectively. The experimental results that the IFS-BIGGA can effectively reduce the dimension of features, which means it can effectively reduce the calculation cost, improves the detection efficieney in the actual Web spam detection inspection project.
出处
《计算机应用》
CSCD
北大核心
2018年第1期295-299,共5页
journal of Computer Applications
基金
四川省学术和技术带头人后备人选科研基金资助项目(WZ0100112371408
YH1500411031402)
四川省学术和技术带头人科研基金资助项目(WZ0100112371601/004)
四川省科技服务业示范项目(2016GFW0166)~~
关键词
特征选择
遗传算法
信息增益
随机森林算法
欺诈网页检测
feature selection
Genetic Algorithm (GA)
Information Gain ( IG),
Random Forest (RF) algorithm
Webspare detection