期刊文献+

基于Bagging-SVM集成分类器的网页作弊检测

Web Spam Detection Based on Integrated Classifier with Bagging-SVM
下载PDF
导出
摘要 网页作弊不仅造成信息检索质量下降,而且给互联网的安全也带来了极大的挑战。提出了一种基于Bagging-SVM集成分类器的网页作弊检测方法。在预处理阶段,首先采用K-means方法解决数据集的不平衡问题,然后采用CFS特征选择方法筛选出最优特征子集,最后对特征子集进行信息熵离散化处理。在分类器训练阶段,通过Bagging方法构建多个训练集并分别对每个训练集进行SVM学习来产生弱分类器。在检测阶段,通过多个弱分类器投票决定测试样本所属类别。在数据集WEBSPAM-UK2006上的实验结果表明,在使用特征数量较少的情况下,本检测方法可以获得非常好的检测效果。 Web spam not only declines the quality of information retrieval,but also causes troubles to the security of Internet.This paper proposed a Baggin-based integration of SVM to detect Web spam.In preprocessing stage,a technique referring to K-means is introduced to solve the class-imbalance problem of dataset firstly,and then an optimal feature subset is culled by using CFS.Finally the optimal feature subset is discretized by the information entropy.In the stage of classifier training,several training datasets are obtained by Bagging and each training dataset is utilized to produce weak classifier respectively after SVM learning.In detection stage,test samples are voted by weak classifiers obtained before detemining their categories.Experimental results on the WEBSPAM-UK2006 reveal that the proposed method can achieve better results with less number of features.
出处 《计算机科学》 CSCD 北大核心 2015年第1期239-243,共5页 Computer Science
基金 四川省学术和技术带头人后备人选培养基金(X800912371309)资助
关键词 网页作弊 集成分类器 特征选择 信息熵 弱分类器 Web spam Integrated classifier Feature selection Information entropy Weak classifier
  • 相关文献

参考文献16

  • 1中国互联网信息中心.《第33次中国互联网络发展状况统计报告》[R].2014.http://www.cnnic.net.cn/hlwfzyj/hlwxzbg/hlwtjbg/201401/t20140116-43820.htm.
  • 2Gy6ngyi Z, Garcia-Molina H. Web spam taxonomy [C]//Pro- ceedings of the 1st International Workshop on Adversarial In- formation Retrieval (AIRWeb 2005). 2005:39-47.
  • 3Egele M, Kolbitsch C, Platzer C. Removing web spam links from search engine results[J]. Journal in Computer Virology, 2011,7 (1):51-62.
  • 4360互联网安全中心.2013年中国网站安全研究报告[R].[2014-01-01].http://awterbbwfk.15.yunpacn/lk/QpvTmqTwb9ci7.
  • 5360互联网安全中心.2013年中国网购安全报告[R].[2014-03-12].http://aqv4kwspvd.15.yunpan.cn/lk/Q4zjDEguzcwnx.
  • 6Henzinger M R, Motwani R, Silverstein C. Challenges in Web search engines[C]//ACM SIGIR Forum. ACM, 2002:11-22.
  • 7GyOngyi Z, Garcia-Molina H, Pedersen J. Combating web spam with TrustRank[C]///Proceedings of the 30th international con- ference on Very large data bases(VLDB 2004). 2004:576-587.
  • 8Wu 13, Davison l D. Identifying link farm spare pages[C]//Spe- cial Interest Tracks and Posters of the 14th International Con- ference on World Wide Web. ACM, 2005:820-829.
  • 9Suhara Y, Toda H, Nishioka S, et al. Automatically generated spare detection based on sentence-level topic information[C]// Proceedings of the 22nd International Conference on World Wide Web Companion. 2013:1157-1160.
  • 10Chung Y,Toyoda M. A Method for Detecting Hijacked Sites by Web Spammer using Link-based Algorithms[J]. IEICE Tran- sactions on Information and Systems, 2010, E93-D (6):1414- 1421.

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部