摘要
【目的】充分利用网站多源评测指标,研究恶意网站的评估和识别问题。【方法】在广泛收集网站多源评测指标的基础上,采用主成分分析法对恶意网站进行多维度评估,并在此基础上利用随机森林分类算法构建恶意网站识别模型。【结果】所构建方法可以有效提取权威、引用、访问量、排名、链接5个评估维度;同时,基于主成分分析法和随机森林的恶意网站识别模型具有较高的准确率和识别效率。【局限】受数据获取的限制,本研究样本大多属于国外网站,所提取的维度可能与国内恶意网站有一定差异;同时没有考虑恶意网站与正常网站的数量存在不均衡问题。【结论】所构建的基于主成分分析和随机森林的模型既可以提取具有较好解释性的网站评价维度,又具有较高的识别准确率和效率,对后续恶意网站的评估与识别研究具有借鉴意义。
[Objective] This study aims to assess and identify malicious websites with the help of multi-source evaluation metrics. [Methods] We used the principal component analysis(PCA) to conduct a multi-dimensional assessment of malicious websites based on multi-source metrics of websites. Then, we built a malicious site identification model using random forest based on the assessment. [Results] We found that the PCA could effectively extract five assessment dimensions: authority, references, website traffic, ranking, and links. Meanwhile, the identification model was accurate and efficient. [Limitations] Most of the samples in this study were foreign websites, which means the extracted dimensions may be different from those in China. Additionally, we did not study the ratio of malicious to normal websites. [Conclusions] The proposed model could effectively extract dimensions for website assessment and then identifies the malicious ones.
作者
陈远
王超群
胡忠义
吴江
Chen Yuan;Wang Chaoqun;Hu Zhongyi;Wu Jiang(School of Information Management, Wuhan University, Wuhan 430072, China;The Center for Electronic Commerce Research and Development, Wuhan University, Wuhan 430072, China)
出处
《数据分析与知识发现》
CSSCI
CSCD
北大核心
2018年第4期71-80,共10页
Data Analysis and Knowledge Discovery
基金
国家自然科学基金面上项目"创新2.0超网络中知识流动和群集交互的协同研究"(项目编号:71373194)和国家自然科学基金青年基金项目"基于集成学习的区间型电力负荷预测技术研究"(项目编号:71601147)的研究成果之一
关键词
恶意网站
评估与识别
主成分分析
随机森林
Malicious Websites
Assessment and Identification
Principal Component Analysis
Random Forest