

Web Spam Detection by the Genetic Programming-based Ensemble Learning
摘要 网络作弊检测是搜索引擎的重要挑战之一,该文提出基于遗传规划的集成学习方法 (简记为GPENL)来检测网络作弊。该方法首先通过欠抽样技术从原训练集中抽样得到t个不同的训练集;然后使用c个不同的分类算法对t个训练集进行训练得到t*c个基分类器;最后利用遗传规划得到t*c个基分类器的集成方式。新方法不仅将欠抽样技术和集成学习融合起来提高非平衡数据集的分类性能,还能方便地集成不同类型的基分类器。在WEBSPAM-UK2006数据集上所做的实验表明无论是同态集成还是异态集成,GPENL均能提高分类的性能,且异态集成比同态集成更加有效;GPENL比AdaBoost、Bagging、RandomForest、多数投票集成、EDKC算法和基于Prediction Spamicity的方法取得更高的F-度量值。 Web spam detection is a challenging issue for web search engines. This paper proposes a Genetic Program- ming-based ensemble learning approach (GPENL) to detect web spare. First, the method gets t different training sets by the under-sampling from the original training set. Then, c different classification algorithms are used on t training sets to get t * c base classifiers. Finally, an integrated approach of t * c base classifiers is obtained by Genet- ic Programming. The new method can not only merge the under-sampling technology and ensemble learning to im- prove the classification performance on imbalanced datasets, but also conveniently integrate different types of base classifiers. The experiments on WEBSPAM-UK2006 show that this method improve the classification performance whether the base classifiers belong to the same type or not, and in most cases the heterogeneous classifier ensembles work better than the homogeneous ones and GPENL can get higher F-measure than those clone by AdaBoost, Bag- ging, RandomForest, Vote, EDKC algorithm and the method based on Prediction Spamicity.
出处 《中文信息学报》 CSCD 北大核心 2012年第5期94-100,共7页 Journal of Chinese Information Processing
基金 国家自然科学基金资助项目(60970047 61103151 61173068) 山东省自然科学基金资助项目(Y2008G19) 山东省高等学校优秀青年教师国内访问学者资助项目
关键词 网络作弊 集成学习 遗传规划 非平衡数据集分类 web spam ensemble learning genetic programming classification on the imbalanced dataset
  • 相关文献


  • 1Becchetti L. , Castillo I C. , Donato I D. , et al. Using Rank Propagation and Probabilistic Counting for Link Based Spare Detection[C]//Proceedings of WebKDD 2006, Vol 4811: 127-146.
  • 2Ntoulas A. , Najork M. , Manasse M. , et al. Detec- ting spam web pages through content analysis [C]// Proceedings of the 15th International Conference on World Wide Web, WWW 2006: 83-92.
  • 3Gyongyi Z. , Garcia-Molina H. , Pedersen J: Comba- ting web spare with trustrank [C]//Proceedings of the 30th International Conference on Very Large Data Ba- ses, 2004, 30: 576-587.
  • 4L. Becchetti, C. Castillo, D. Donato, et al. Link- based characterization and detection of Web Spare [C]//Proceedings of AIRWeb, 2006.
  • 5Castillo C. , Donato D. , Murdock V. , et aL Knowyour neighbors: Web spam detection using the Web to- pology [C]//Proceedings of SIGIR2007: 423-430.
  • 6Na Dai, Brian D. Davison, Xiaoguang Qi. Looking in- to the Past to Better Classify Web Spare [C]//Pro- ceedings of AIRWeb '09, Madrid, Spain (April 21, 2009).
  • 7Yiqun Liu, Rongwei Cen, Min Zhang, et al. Identif ying Web Spare with User Behavior Analysis [C]// Proceedings of AIRWeb 2008, Beijing, China, April 22.
  • 8Xu-Ying Liu, Jian-xin Wu, Zhi-Hua Zhou. Explorato- ry under-sampling for class-imbalance learning [J]. IEEE Systems, Man, and Cybernetics Society, 2009, 39 (2): 539-550.
  • 9Koza J. R: Genetic Programming: On the Program ruing of Computers by Means of Natural Selection [M]. MIT Press, Cambridge, 1992.
  • 10Guang-Gang Geng, Chun-Heng Wang, Qiu-Dan Li, et al. Boosting the Performance of Web Spam Detec- tion with Ensemble Under-Sampling Classification [C]//Proceedings of FSKD(4) 2007: 583-587.








使用帮助 返回顶部