期刊文献+

基于单页语义特征的垃圾网页检测

Web spam detection based on semantic features from current page
下载PDF
导出
摘要 为解决垃圾网页检测中特征提取难度高、计算量大的问题,提出一种仅基于当前网页的HTML脚本提取语义特征的方法。首先使用深度优先搜索和动态规划相结合的记忆化搜索算法对域名进行单词切割,采用隐含狄利克雷分布提取主题词,基于Word2Vec词向量和词移距离计算3个单页语义相似度特征;然后将单页语义相似度特征融合单页统计特征,使用随机森林等分类算法构建分类模型进行垃圾网页检测。实验结果表明,基于单页内容提取语义特征融合单页统计特征进行分类的AUC值达到88.0%,比对照方法提高4%左右。 In order to solve the problem of high difficulty and large amount of computation in feature extraction for web spam de‐tection,a method for extracting semantic features only based on the HTML script of the current page is proposed.Firstly,the do‐main name is segmented by a memorization search algorithm combining depth-first search and dynamic programming.Secondly,The latent Dirichlet distribution is used to extract subject words of the web page.Lastly,three single-page semantic similarity fea‐tures are calculated based on Word2Vec and word mover distance.Combining the single-page semantic similarity features with single-page statistical features,classification algorithms such as random forest are used to build classification models for web spam detection.The experimental results show that the AUC value of single-page content extraction based on semantic and statis‐tical features for classification reaches 88.0%,which is about 4%higher than that of the control method.
作者 陈木生 高斐 吴俊华 Chen Musheng;Gao Fei;Wu Junhua(School of Software Engineering,Jiangxi University of Science and Technology,Nanchang 330013,China;Nanchang Key Laboratory of Virtual Digital Engineering and Cultural Communication,Nanchang 330013,China)
出处 《电子技术应用》 2023年第6期24-29,共6页 Application of Electronic Technique
基金 江西省教育厅科学研究项目(GJJ180450) 江西省教育厅科学研究项目(GJJ200839) 江西理工大学博士启动基金(205200100402)。
关键词 垃圾网页检测 特征提取 记忆化搜索 隐含狄利克雷分布 词向量 词移距离 随机森林 web spam detection feature extraction memory search latent Dirichlet distribution Word2Vec word mover dis‐tance random forest
  • 相关文献

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部