摘要
针对传统邮件分类模型中较少对邮件主题进行描述和分析的问题,提出一种代价敏感多主题学习的邮件分类算法,用以实现垃圾邮件过滤.首先,基于LDA(潜在狄利克雷分布)对邮件的多个主题进行提取,对邮件语义进行描述;其次,利用CS-SVM(代价敏感支持向量机)对邮件进行代价敏感学习,实现对不同类别邮件的惩罚;最后,结合MI-SVM(多示例支持向量机)进行代价敏感的多主题学习,实现邮件分类.实验采用四组ling-spam处理数据集.实验结果证明:该分类算法较比传统邮件分类算法,可以取得更高的准确性、特异性与敏感性.
To solve the problem of the lack of description and analysis of topics in traditional email classification model,this paper proposed a spam filtering algorithm based on cost sensitive learning for multiple topics.Firstly,multiple topics were extracted by using LDA to describe the semantics of email.Secondly,CS-SVM was used for cost sensitive learning,which was used to penalize different kinds of emails.Finally,MI-SVM combining with CS-SVM was used to learn multiple topics for email classification.There were four sub-datasets of Ling-Spam used in the experiments.Experimental results show that compared to the traditional classification methods,the presented algorithm has better performance of accuracy,specificity and sensitivity.
作者
张绍成
刘威
程子傲
王丹华
Zhang Shaocheng;Liu Wei;Cheng Ziao;Wang Danhua(Informatization Center,Liaoning University,Shenyang 110036,China;Information Network Center,Shenyang Jianzhu University,Shenyang 110168,China)
出处
《华中科技大学学报(自然科学版)》
EI
CAS
CSCD
北大核心
2016年第S1期176-180,共5页
Journal of Huazhong University of Science and Technology(Natural Science Edition)
基金
国家自然科学基金资助项目(61502092)
关键词
潜在狄利克雷分布
支持向量机
垃圾邮件过滤
文本分类
多示例学习
latent Dirichlet allocation
support vector machine
spam filtering
text classification
multiple-instance learning