期刊文献+

基于主成分分析和随机森林的恶意网站评估与识别 被引量:8

Identifying Malicious Websites with PCA and Random Forest Methods
原文传递
导出
摘要 【目的】充分利用网站多源评测指标,研究恶意网站的评估和识别问题。【方法】在广泛收集网站多源评测指标的基础上,采用主成分分析法对恶意网站进行多维度评估,并在此基础上利用随机森林分类算法构建恶意网站识别模型。【结果】所构建方法可以有效提取权威、引用、访问量、排名、链接5个评估维度;同时,基于主成分分析法和随机森林的恶意网站识别模型具有较高的准确率和识别效率。【局限】受数据获取的限制,本研究样本大多属于国外网站,所提取的维度可能与国内恶意网站有一定差异;同时没有考虑恶意网站与正常网站的数量存在不均衡问题。【结论】所构建的基于主成分分析和随机森林的模型既可以提取具有较好解释性的网站评价维度,又具有较高的识别准确率和效率,对后续恶意网站的评估与识别研究具有借鉴意义。 [Objective] This study aims to assess and identify malicious websites with the help of multi-source evaluation metrics. [Methods] We used the principal component analysis(PCA) to conduct a multi-dimensional assessment of malicious websites based on multi-source metrics of websites. Then, we built a malicious site identification model using random forest based on the assessment. [Results] We found that the PCA could effectively extract five assessment dimensions: authority, references, website traffic, ranking, and links. Meanwhile, the identification model was accurate and efficient. [Limitations] Most of the samples in this study were foreign websites, which means the extracted dimensions may be different from those in China. Additionally, we did not study the ratio of malicious to normal websites. [Conclusions] The proposed model could effectively extract dimensions for website assessment and then identifies the malicious ones.
作者 陈远 王超群 胡忠义 吴江 Chen Yuan;Wang Chaoqun;Hu Zhongyi;Wu Jiang(School of Information Management, Wuhan University, Wuhan 430072, China;The Center for Electronic Commerce Research and Development, Wuhan University, Wuhan 430072, China)
出处 《数据分析与知识发现》 CSSCI CSCD 北大核心 2018年第4期71-80,共10页 Data Analysis and Knowledge Discovery
基金 国家自然科学基金面上项目"创新2.0超网络中知识流动和群集交互的协同研究"(项目编号:71373194)和国家自然科学基金青年基金项目"基于集成学习的区间型电力负荷预测技术研究"(项目编号:71601147)的研究成果之一
关键词 恶意网站 评估与识别 主成分分析 随机森林 Malicious Websites Assessment and Identification Principal Component Analysis Random Forest
  • 相关文献

参考文献4

二级参考文献46

  • 1王惠文.用主成分分析法建立系统评估指数的限制条件浅析[J].系统工程理论与实践,1996,16(9):24-29. 被引量:19
  • 2李闯,丁晓青,吴佑寿.一种改进的AdaBoost算法——AD AdaBoost[J].计算机学报,2007,30(1):103-109. 被引量:53
  • 3王学民.对主成分分析中综合得分方法的质疑[J].统计与决策,2007,23(8):31-32. 被引量:67
  • 4何平.我国综合评价活动发展述评[EB/OL].http://www.sts.org.cn/fxyj/zbtx/documents/zhps.htm,2005.
  • 5Anti-Phishing Working Group [EB/OL]. http://www.antiphishing. org, 2008-01/2011-12-15.
  • 6PhishTank [EB/OL]. http://www.phishtank.com, 2011-04/2011-12-15.
  • 7Engin Kirda, Christopher Kruegel. Protecting Users against Phishing Attacks[J]. The Computer Journal, 2006, 49(05):554-561.
  • 8Ian Fette, Norman Sadeh, Anthony Tomasic. Learning to Detect Phishing Emails[C]. In Proc. of the WWW 2007, Alberta, Canada, May 8-12, 2007: 649-656.
  • 9Chenfeng Vincent Zhou, Christopher Leckie, Shanika Karunasekera. Collaborative Detection of Fast Flux Phishing Domains[J]. Journal of Networks, 2009, 4(01):75-84.
  • 10D. Kevin McGrath, Minaxi Gupta. Behind Phishing: An Examination of Phisher Modi Operandi[C]. In Proc. of the 1st Usenix Workshop on Large- Scale Exploits and Emergent Threats, California USA, April 15 2008:1-8.

共引文献463

同被引文献74

引证文献8

二级引证文献11

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部