期刊文献+

基于免疫克隆特征选择和欠采样集成的垃圾网页检测 被引量:3

Web spam detection based on immune clonal feature selection and under-sampling ensemble
下载PDF
导出
摘要 为解决垃圾网页检测过程中的"维数灾难"和不平衡分类问题,提出一种基于免疫克隆特征选择和欠采样(US)集成的二元分类器算法。首先,使用欠采样技术将训练样本集大类抽样成多个与小类样本数相近的样本集,再将其分别与小类样本合并构成多个平衡的子训练样本集;然后,设计一种免疫克隆算法遴选出多个最优的特征子集;基于最优特征子集对平衡的子样本集进行投影操作,生成平衡数据集的多个视图;最后,用随机森林(RF)分类器对测试样本进行分类,采用简单投票法确定测试样本的最终类别。在WEBSPAM UK-2006数据集上的实验结果表明,该集成分类器算法应用于垃圾网页检测:与随机森林算法及其Bagging和Ada Boost集成分类器算法相比,准确率、F1测度、AUC等指标均提高11%以上;与其他最优的研究结果相比,该集成分类器算法在F1测度上提高2%,在AUC上达到最优。 To solve the problem of "curse of dimensionality" and imbalance classification, a binary classifier algorithm based on immune clonal feature selection and Under-Sampling( US) ensemble was proposed to detect Web spam. Firstly,major samples in training dataset were sampled into several sample subsets, which were combined with minor samples to generate several balanced training sample subsets. Then an immune clonal algorithm was proposed to select several optimal feature subsets. The balanced training subsets were projected to multiple views based on the optimal feature subsets. Finally,several Random Forest( RF) classifiers were trained by these views of the training sample subsets to classify the testing samples. The testing samples' classifications were determined by voting. The experimental results on the WEBSPAM UK-2006 dataset show that the ensemble classifier algorithm outperforms these algorithms like RF, Bagging with RF and Ada Boost with RF, and its accuracy, F1-Measure, AUC( Area Under ROC Curve) are increased by more than 11% respectively. Compared with several state-of-the-art baseline classification models, the F1-Measure is increased by 2% and the AUC reaches the optimum result using the ensemble classifier.
出处 《计算机应用》 CSCD 北大核心 2016年第7期1899-1903,共5页 journal of Computer Applications
基金 江西省科技支撑计划项目(20131102040039)~~
关键词 垃圾网页检测 集成学习 免疫克隆算法 特征选择 欠采样 随机森林 Web spam detection ensemble learning immune clonal algorithm feature selection Under-Sampling(US) Random Forest(RF)
  • 相关文献

参考文献18

  • 1SPIRIN N, HAN J. Survey on Web spam detection: principles and algorithms [J]. ACM SIGKDD Explorations Newsletter, 2012, 13 (2) : 50 - 64.
  • 2CHANDRA A, SUAIB M. A survey on Web spare and spare 2.0 [ J]. International Journal of Advanced Computer Research, 2014, 4(2) : 634 -644.
  • 3TAHIR M A, BOURIDANE A, KURUGOLLU F. Simultaneous fea- ture selection and feature weighting using hybrid tabu search/K-nea- rest neighbor classifier [ J]. Pattern Recognition Letters, 2007, 28 (4) : 438 -446.
  • 4BONEV B, ESCOLANO F, CAZORLA M. Feature selection, mutu- al information, and the classification of high-dimensional patterns [ J]. Pattern Analysis and Applications, 2008, 11 (3/4) : 309 - 319.
  • 5MOUSTAKIDIS S P, THEOCHARIS J B. A fast SVM-based wrap- per feature selection method driven by a fuzzy complementary criteri- on [J]. Pattern Analysis and Applications, 2012, 15(4): 379 - 397.
  • 6LIN S, LEE Z, CHEN S, et al. Parameter determination of support vector machine and feature selection using simulated annealing ap- proach [J]. Applied Soft Computing, 2008, 8(4): 1505 -1512.
  • 7AHMED A. Feature subset selection using ant colony optimization [ J]. International Journal of Computational Intelligence and Appli- cations, 2005, 2(1): 53-58.
  • 8AHMAD F, ISA N A M, HUSSAIN Z, et al. A GA-based feature selection and parameter optimization of an ANN in diagnosing breast cancer [ J]. Pattern Analysis and Applications, 2014, 18(4) : 861 - 870.
  • 9MARINAKI M, MARINAKIS Y. A hybridization of clonal selection algorithm with iterated local search and variable neighborhood search for the feature selection problem [ J]. Memetic Computing, 2015, 7 (3): 181 -201.
  • 10SAMADZADEGAN F, NAMIN S R, RAJABI M A. Evaluating the potential of clonal selection optimization algorithm to hyperspectral image feature selection [J]. Key Engineering Materials, 2012, 500 (1) : 799 - 805.

二级参考文献16

  • 1林舒杨,李翠华,江弋,林琛,邹权.不平衡数据的降采样方法研究[J].计算机研究与发展,2011,48(S3):47-53. 被引量:31
  • 2GYONGYI Z, GARCIA-MOLINA H. Web spam taxonomy [ C]// Proceedings of the 14st International Workshop on Adversarial Information Retrieval on the Web. Chiba, Japan: AIRWeb, 2005:39-47.
  • 3EIRON N, MCCURLEY K S. Analysis of anchor text for Web search [ C]// Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2003:459-460.
  • 4SPIRIN N, HAN J. Survey on Web spam detection: principles and algorithms [ J]. ACM SIGKDD Explorations Newsletter, 2012, 13 (2): 50-64.
  • 5CHANDRA A, SUAIB M. A survey on Web spam and spam 2.0 [ J]. International Journal of Advanced Research in Computer Science, 2014,4(15) : 634 -644.
  • 6PRIETO V M, ALVAREZ M, CACHEDA F. SAAD, a content based Web spam analyzer and detector [ J]. Journal of Systems and Software, 2013, 86(11) : 2906 - 2918.
  • 7SCARSELLI F, TSOI A C, HAGENBUCHNER M, et al. Solving graph data issues using a layered architecture approach with applications to Web spam detection [ J]. Neural Networks, 2013, 48(1) : 78 - 90.
  • 8GAO S, ZHANG H, ZHENG X, et al. Improving SVM classifiers with link structure for Web spam detection [ J]. Journal of Computational Information Systems, 2014, 10(6) :2435 -2443.
  • 9BREIMAN L. Random forests-- random features [J]. Machine Learning, 1999, 45 ( 1 ) : 5 - 32.
  • 10BREIMAN L, FRIEDMAN J, OLSHEN R, et al. Classification and regression trees [M]. Boca Raton, FL: CRC Press, 1984:18 -Sg.

共引文献16

同被引文献16

引证文献3

二级引证文献11

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部