Improving Knowledge Based Spam Detection Methods: The Effect of Malicious Related Features in Imbalance Data Distribution 被引量：4

Improving Knowledge Based Spam Detection Methods: The Effect of Malicious Related Features in Imbalance Data Distribution

下载PDF

导出

摘要 Spam is no longer just commercial unsolicited email messages that waste our time, it consumes network traffic and mail servers’ storage. Furthermore, spam has become a major component of several attack vectors including attacks such as phishing, cross-site scripting, cross-site request forgery and malware infection. Statistics show that the amount of spam containing malicious contents increased compared to the one advertising legitimate products and services. In this paper, the issue of spam detection is investigated with the aim to develop an efficient method to identify spam email based on the analysis of the content of email messages. We identify a set of features that have a considerable number of malicious related features. Our goal is to study the effect of these features in helping the classical classifiers in identifying spam emails. To make the problem more challenging, we developed spam classification models based on imbalanced data where spam emails form the rare class with only 16.5% of the total emails. Different metrics were utilized in the evaluation of the developed models. Results show noticeable improvement of spam classification models when trained by dataset that includes malicious related features. Spam is no longer just commercial unsolicited email messages that waste our time, it consumes network traffic and mail servers’ storage. Furthermore, spam has become a major component of several attack vectors including attacks such as phishing, cross-site scripting, cross-site request forgery and malware infection. Statistics show that the amount of spam containing malicious contents increased compared to the one advertising legitimate products and services. In this paper, the issue of spam detection is investigated with the aim to develop an efficient method to identify spam email based on the analysis of the content of email messages. We identify a set of features that have a considerable number of malicious related features. Our goal is to study the effect of these features in helping the classical classifiers in identifying spam emails. To make the problem more challenging, we developed spam classification models based on imbalanced data where spam emails form the rare class with only 16.5% of the total emails. Different metrics were utilized in the evaluation of the developed models. Results show noticeable improvement of spam classification models when trained by dataset that includes malicious related features.

作者 Jafar Alqatawna Hossam Faris Khalid Jaradat Malek Al-Zewairi Omar Adwan

机构地区 King Abdullah II School for Information Technology

出处《International Journal of Communications, Network and System Sciences》 2015年第5期118-129,共12页 通讯、网络与系统学国际期刊（英文）

关键词 SPAM E-MAIL MALICIOUS SPAM SPAM Detection SPAM FEATURES Security Mechanism Data Mining Spam E-Mail Malicious Spam Spam Detection Spam Features Security Mechanism Data Mining

分类号 R73 [医药卫生—肿瘤]

引文网络
相关文献

同被引文献16

1宋余庆,陈健美,郭依正,王春红.基于多特征融合的医学图像识别研究[J].计算机应用研究,2008,25(6):1750-1752. 被引量：8
2杨明,尹军梅,吉根林.不平衡数据分类方法综述[J].南京师范大学学报（工程技术版）,2008,8(4):7-12. 被引量：28
3李勇,刘战东,张海军.不平衡数据的集成分类算法综述[J].计算机应用研究,2014,31(5):1287-1291. 被引量：73
4菅小艳,韩素青,崔彩霞.不平衡数据集上的Relief特征选择算法[J].数据采集与处理,2016,31(4):838-844. 被引量：15
5黄海松,魏建安,康佩栋.基于不平衡数据样本特性的新型过采样SVM分类算法[J].控制与决策,2018,33(9):1549-1558. 被引量：26
6刘树毅,翟晔,刘东升.融合多策略特征筛选的跨项目软件缺陷预测[J].计算机工程与应用,2019,55(8):53-58. 被引量：7
7刘定祥,乔少杰,张永清,韩楠,魏军林,张榕珂,黄萍.不平衡分类的数据采样方法综述[J].重庆理工大学学报（自然科学）,2019,33(7):102-112. 被引量：28
8郑建华,刘双印,贺超波,符志强.基于混合采样策略的改进随机森林不平衡数据分类算法[J].重庆理工大学学报（自然科学）,2019,33(7):113-123. 被引量：12
9陈志,郭武.不平衡训练数据下的基于深度学习的文本分类[J].小型微型计算机系统,2020,41(1):1-5. 被引量：21
10平瑞,周水生,李冬.高度不平衡数据的代价敏感随机森林分类算法[J].模式识别与人工智能,2020,33(3):249-257. 被引量：22

引证文献4

1王芳,吴文通,张立立,马瑞,徐文星.邻域自适应SMOTE算法研究[J].计算机应用研究,2021,38(6):1673-1677. 被引量：4
2王乐,韩萌,李小娟,张妮,程浩东.不平衡数据集分类方法综述[J].计算机工程与应用,2021,57(22):42-52. 被引量：29
3Adel Hamdan Mohammad,Sami Smadi,Tariq Alwada’n.Email Filtering Using Hybrid Feature Selection Model[J].Computer Modeling in Engineering & Sciences,2022(8):435-450.
4程凤伟,常浩.面向非平衡数据的大间隔近邻Relief算法[J].山西大学学报（自然科学版）,2022,45(4):1014-1022. 被引量：1

二级引证文献34

1陈可.基于B-SMOTE1-XGBoost预测电信客户流失[J].郑州师范教育,2022,11(4):21-26.
2程凤伟,王文剑,张珍珍.面向高维小样本数据的层次子空间ReliefF特征选择算法[J].南京大学学报（自然科学版）,2023,59(6):928-936.
3李偲希,白全生,舒畅,肖祥武.基于spark平台的供电煤耗并行回归预测[J].电力大数据,2021,24(11):85-92. 被引量：1
4李耀华,赵承辉,周逸凡,秦玉贵.基于数据驱动的永磁同步电机深度神经网络控制[J].电机与控制学报,2022,26(1):115-125. 被引量：16
5梅大成,陈江,郑涛.边界与密度适应的SMOTE算法研究[J].计算机应用研究,2022,39(5):1478-1482. 被引量：5
6吴学亮,娄莉.样本均衡与特征选择在员工离职倾向预测上的应用[J].智能计算机与应用,2022,12(7):181-184. 被引量：1
7李耀华,刘东梅,赵承辉,刘子焜,王孝宇,陈桂鑫.基于CNN的MPTC与DTC自适应切换的表贴式永磁同步电机控制策略[J].电机与控制应用,2022,49(5):8-13. 被引量：2
8韩磊,黄瑞龙,范文静,叶明全.基于Weka平台和代价敏感特征选择的基因表达数据分类研究[J].智慧健康,2022,8(17):1-4. 被引量：2
9董庆伟.基于Adaboost算法的不平衡数据集分类效果研究[J].长春师范大学学报,2022,41(6):49-52.
10吴正江,杨天,郑爱玲,梅秋雨,张亚宁.融合拟单层覆盖粗集的集值数据平衡方法研究[J].计算机工程与应用,2022,58(19):166-173. 被引量：3

1Yong Hu,Dongfa Guo,Zengwei Fan,Chen Dong,Qiuhong Huang,Shengkai Xie,Guifang Liu,Jing Tan,Boping Li,Qiwei Xie.An Improved Algorithm for Imbalanced Data and Small Sample Size Classification[J].Journal of Data Analysis and Information Processing,2015,3(3):27-33.
2Huan Zhao,Meizhi Wang,Guodong Chen,Dan Hu,Enqing Li,Yibo Qu,Libing Zhou,Liangdong Guo,Xinsheng Yao,Hao Gao.Dimericbiscognienynes B and C: New diisoprenyl-cyclohexene-type meroterpenoid dimers from Biscogniauxia sp.[J].Chinese Chemical Letters,2019,30(1):51-54. 被引量：3
3李克文,谢鹏,路慎强.基于不平衡数据类分布学习的特征选择方法[J].计算机与数字工程,2019,47(9):2257-2261. 被引量：1
4Jianjun Cao,Xiaofang Zhang.Chinese Journals' Chief Editors Should Enhance Their Response Rate to Authors[J].Journal of Management Science & Engineering Research,2018,1(1):7-10.
5叶超.基于MD5的CSRF防御模块的设计与实现[J].信息安全研究,2019,5(3):223-229. 被引量：3
6Mohamed Ahmed Reda Hamed.Application of Surface Water Quality Classification Models Using Principal Components Analysis and Cluster Analysis[J].Journal of Geoscience and Environment Protection,2019,7(6):26-41. 被引量：2
7Yuqing Li,Jingyu Liu,Haoyu Tang,Yanping Qiu,Dandan Chen,Wen Liu.Discovery of New Thioviridamide-Like Compounds with Antitumor Activities[J].Chinese Journal of Chemistry,2019,37(10):1015-1020. 被引量：1
8Yuedong Song,Pietro Liò.A new approach for epileptic seizure detection: sample entropy based feature extraction and extreme learning machine[J].Journal of Biomedical Science and Engineering,2010,3(6):556-567. 被引量：8
9Lan Anh T. Nguyen,Xuan Tho Dang,Tu Kien T. Le,Thammakorn Saethang,Vu Anh Tran,Duc Luu Ngo,Sergey Gavrilov,Ngoc Giang Nguyen,Mamoru Kubo,Yoichi Yamada,Kenji Satou.Predicting Βeta-Turns and Βeta-Turn Types Using a Novel Over-Sampling Approach[J].Journal of Biomedical Science and Engineering,2014,7(11):927-940.
10A. C. de la Casa,G. G. Ovando.Estimation of Wheat Area in Córdoba, Argentina, with Multitemporal NDVI Data of SPOT-Vegetation[J].International Journal of Geosciences,2013,4(10):1355-1364. 被引量：1

International Journal of Communications, Network and System Sciences

2015年第5期

浏览历史

内容加载中请稍等...

Improving Knowledge Based Spam Detection Methods: The Effect of Malicious Related Features in Imbalance Data Distribution 被引量：4

同被引文献16

引证文献4

二级引证文献34

相关作者

相关机构

相关主题

浏览历史