期刊文献+

欺诈网页检测中基于遗传算法的特征优选 被引量:8

Optimum feature selection based on genetic algorithm under Web spam detection
下载PDF
导出
摘要 针对网页欺诈检测中特征的高维、冗余问题,提出一个基于信息增益和遗传算法的改进特征选择算法(IFS-BIGGA)。首先,通过信息增益(IG)给出特征重要性排序,设定动态阈值减少冗余特征;其次,改进遗传算法(GA)中染色体编码函数和选择算子,并结合随机森林(RF)的受试者工作特征曲线面积(AUC)作为适应度函数,选择高辨识度特征;最后,增加实验迭代次数避免算法随机性,产生最佳最小的特征集合(OMFS)。实验验证表明,应用IFS-BIGGA生成的OMFS与高维特征集合相比,尽管RF下的AUC减小了2%,但是真阳性率(TPR)提高了21%,并且特征维度减少了92%;同时多个常用分类器的平均检测时间减少了83%;另外,IFS-BIGGA的F1值相比传统的遗传算法(TGA)和帝国主义竞争算法(ICA)分别提高了4.2%和3.5%。实验结果表明,IFS-BIGGA可以进行高效特征降维,在实际的网页检测工程中,有效减少计算代价,提高检测效率。 Focusing on the issue that features used in Web spam detection are always high-dimensional and redundant, an Improved Feature Selection method Based on Information Gain and Genetic Algorithm (IFS-BIGGA) was proposed. Firstly, the priorities of features were ranked by Information Gain (IG), and dynamic threshold was set to get rid of redundant features. Secondly, the function of chromosome encoding was modified and the selection operator was improved in Genetic Algorithm (GA). After that, the Area Under receiver operating Characteristic (AUC) of Random Forest (RF) classifier was utilized as the fitness function to pick up the features with high degree of identification. Finally, the Optimal Minimum Feature Set (OMFS) was obtained by increasing the experimental iteration to avoid the randomness of the proposed algorithm. The experimental results show that OMFS, compared to the high-dimensional feature set, although the AUC under RF is decreased by 2%, the True Positive Rate (TPR) is increased by 21% and the feature dimension is reduced by 92%. And the average detecting time is decreased by 83%; moreover, by comparing to the Traditional GA (TGA) and Imperialist Competitive Algorithm (ICA), the F1 score under Bayes Net (BN) is increased by 4.2% and 3.5% respectively. The experimental results that the IFS-BIGGA can effectively reduce the dimension of features, which means it can effectively reduce the calculation cost, improves the detection efficieney in the actual Web spam detection inspection project.
出处 《计算机应用》 CSCD 北大核心 2018年第1期295-299,共5页 journal of Computer Applications
基金 四川省学术和技术带头人后备人选科研基金资助项目(WZ0100112371408 YH1500411031402) 四川省学术和技术带头人科研基金资助项目(WZ0100112371601/004) 四川省科技服务业示范项目(2016GFW0166)~~
关键词 特征选择 遗传算法 信息增益 随机森林算法 欺诈网页检测 feature selection Genetic Algorithm (GA) Information Gain ( IG), Random Forest (RF) algorithm Webspare detection
  • 相关文献

参考文献4

二级参考文献69

  • 1刘宏伟,黄静.基于朴素贝叶斯算法的垃圾邮件网关[J].微计算机信息,2006,22(06X):73-75. 被引量:6
  • 2余慧佳,刘奕群,张敏,茹立云,马少平.基于大规模日志分析的搜索引擎用户行为分析[J].中文信息学报,2007,21(1):109-114. 被引量:117
  • 3Gyongyi, Z. and Garcia-Molina, H. Web spam taxonomy. In First International Workshop on Adversarial Information Retrieval on the Web. 2005.
  • 4D. Fetterly, M. Manasse, and M. Najork. Spam, damn spam, and Statistics: Using statistical analysis to locate spam web pages. In: 7th International Workshop on the Web and Databases 2004.
  • 5Z. Gy ngyi, H. Garcia-Molina, and J. Pedersen. Combating Web spam with TrustRank. In VLDB, 2004.
  • 6W. Wang et al. EviRank: An Evidence Based Content Trust Model for Web Spam Detection. APWeb/WAIM 2007 Ws, LNCS 4537, pp. 299 - 307, 2007.
  • 7Krysta M. Svore, Qiang Wu, Chris J.C. Burges. Improving Web Spam Classification using Rank-time Features. AIRWeb '07, May 8, 2007 Banff, Alberta, Canada.
  • 8T. Urvoy, T. Lavergne, and P. Filoche, Tracking Web Spam with Hidden Style Similarity, Proc. 2nd Int'l Workshop on Adversarial Information Retrieval on the Web (AIRWeb 06), 2006:.
  • 9J. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of ACM, 46:119 - 130, 1997.
  • 10A. Bencz'ur, K. Csalog'any, and T. Sarl'os. Link-based similarity search to fight web spam. In Proc. of AIRWEB 2006, Seattle, 2006.

共引文献15

同被引文献76

引证文献8

二级引证文献45

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部