摘要
为了有效地检测垃圾网页,通过分析网页内容特征和链接特征的分布,发现正常网页特征分布有规律而垃圾网页特征分布散乱,根据正常网页特征分布与垃圾网页特征分布的不同,提出了用分布函数拟合正常网页特征分布,并计算正常网页和垃圾网页比例与分布函数的差值,以差值为阈值使用C4.5决策树对垃圾网页进行检测。实验结果表明,该方法能够有效地减少被错误分类的正常网页,提高准确率。
Web spam disturbs users to obtain information normally and to detect spam pages effectively,distribution of web content features and linked features are analyzed.The result shows that normal web features distribute regular but spam web features distribute scattered.Based on the difference distribution,function to fit the distribution of normal web features is employed,and the difference between web proportion and the distribution function is calculated.Finally,C4.5 decision tree is constructed to detect spam pages with difference as threshold.The experimental results show that it can detect spam pages effectively.
出处
《计算机工程与设计》
CSCD
北大核心
2013年第8期2651-2655,共5页
Computer Engineering and Design
基金
国家自然科学基金项目(61170145)
教育部高等学校博士点专项基金项目(20113704110001)
山东省自然科学基金和科技攻关计划基金项目(ZR2010FM021
2008B0026
2010G0020115)
关键词
垃圾网页
内容特征
链接特征
分布函数
决策树
web spam
content features
linked features
distribution function
decision trees