基于Bagging-SVM集成分类器的网页作弊检测

Web Spam Detection Based on Integrated Classifier with Bagging-SVM

下载PDF

导出

摘要网页作弊不仅造成信息检索质量下降,而且给互联网的安全也带来了极大的挑战。提出了一种基于Bagging-SVM集成分类器的网页作弊检测方法。在预处理阶段,首先采用K-means方法解决数据集的不平衡问题,然后采用CFS特征选择方法筛选出最优特征子集,最后对特征子集进行信息熵离散化处理。在分类器训练阶段,通过Bagging方法构建多个训练集并分别对每个训练集进行SVM学习来产生弱分类器。在检测阶段,通过多个弱分类器投票决定测试样本所属类别。在数据集WEBSPAM-UK2006上的实验结果表明,在使用特征数量较少的情况下,本检测方法可以获得非常好的检测效果。 Web spam not only declines the quality of information retrieval,but also causes troubles to the security of Internet.This paper proposed a Baggin-based integration of SVM to detect Web spam.In preprocessing stage,a technique referring to K-means is introduced to solve the class-imbalance problem of dataset firstly,and then an optimal feature subset is culled by using CFS.Finally the optimal feature subset is discretized by the information entropy.In the stage of classifier training,several training datasets are obtained by Bagging and each training dataset is utilized to produce weak classifier respectively after SVM learning.In detection stage,test samples are voted by weak classifiers obtained before detemining their categories.Experimental results on the WEBSPAM-UK2006 reveal that the proposed method can achieve better results with less number of features.

作者唐寿洪朱焱杨凡

机构地区西南交通大学信息科学与技术学院

出处《计算机科学》 CSCD 北大核心 2015年第1期239-243,共5页 Computer Science

基金四川省学术和技术带头人后备人选培养基金(X800912371309)资助

关键词网页作弊集成分类器特征选择信息熵弱分类器 Web spam Integrated classifier Feature selection Information entropy Weak classifier

分类号 TP181 [自动化与计算机技术—控制理论与控制工程]

引文网络
相关文献

参考文献16

1中国互联网信息中心.《第33次中国互联网络发展状况统计报告》[R].2014.http://www.cnnic.net.cn/hlwfzyj/hlwxzbg/hlwtjbg/201401/t20140116-43820.htm.
2Gy6ngyi Z, Garcia-Molina H. Web spam taxonomy [C]//Pro- ceedings of the 1st International Workshop on Adversarial In- formation Retrieval (AIRWeb 2005). 2005:39-47.
3Egele M, Kolbitsch C, Platzer C. Removing web spam links from search engine results[J]. Journal in Computer Virology, 2011,7 (1):51-62.
4360互联网安全中心.2013年中国网站安全研究报告[R].[2014-01-01].http://awterbbwfk.15.yunpacn/lk/QpvTmqTwb9ci7.
5360互联网安全中心.2013年中国网购安全报告[R].[2014-03-12].http://aqv4kwspvd.15.yunpan.cn/lk/Q4zjDEguzcwnx.
6Henzinger M R, Motwani R, Silverstein C. Challenges in Web search engines[C]//ACM SIGIR Forum. ACM, 2002:11-22.
7GyOngyi Z, Garcia-Molina H, Pedersen J. Combating web spam with TrustRank[C]///Proceedings of the 30th international con- ference on Very large data bases(VLDB 2004). 2004:576-587.
8Wu 13, Davison l D. Identifying link farm spare pages[C]//Spe- cial Interest Tracks and Posters of the 14th International Con- ference on World Wide Web. ACM, 2005:820-829.
9Suhara Y, Toda H, Nishioka S, et al. Automatically generated spare detection based on sentence-level topic information[C]// Proceedings of the 22nd International Conference on World Wide Web Companion. 2013:1157-1160.
10Chung Y,Toyoda M. A Method for Detecting Hijacked Sites by Web Spammer using Link-based Algorithms[J]. IEICE Tran- sactions on Information and Systems, 2010, E93-D (6):1414- 1421.

1杨博,陈贺昌,朱冠宇,赵学华.基于超链接多样性分析的新型网页排名算法[J].计算机学报,2014,37(4):833-847. 被引量：9
2王晓丹,高晓峰,姚旭,雷蕾.SVM集成研究与应用[J].空军工程大学学报（自然科学版）,2012,13(2):84-89. 被引量：7
3李智超,余慧佳,刘奕群,马少平.网页作弊与反作弊技术综述[J].山东大学学报（理学版）,2011,46(5):1-8. 被引量：9
4李滨宇,董祎,殷健文,严中庆,高晓.基于CGI的统计图形软件的开发[J].计算机工程与应用,1991,27(11):15-19.
5刘卫红,方卫东,董守斌,张凌.基于内容与链接特征的中文垃圾网页分类[J].微计算机信息,2010,26(9):6-8. 被引量：4
6扈晓君,刘丽,孙斐斐.基于自适应权值的SVM集成学习方法[J].山东师范大学学报（自然科学版）,2015,30(1):20-23.
7佘斌,沈海斌.一种基于Bagging-SVM的智能传感器集成学习方法[J].传感器与微系统,2016,35(2):26-28. 被引量：3
8罗会兰,杜连平.一种SVM集成的图像分类方法研究[J].电视技术,2012,36(23):39-42. 被引量：6
9CONTROL ENGINEERING China 2007年度最佳产品奖投票开始[J].软件,2007,28(10):22-23.
10陈江,单桂军,李正明.基于支持向量机集成学习的网络故障诊断方法[J].计算机测量与控制,2014,22(12):3906-3908. 被引量：1

计算机科学

2015年第1期

浏览历史

内容加载中请稍等...

基于Bagging-SVM集成分类器的网页作弊检测

参考文献16

相关作者

相关机构

相关主题

浏览历史