摘要
垃圾邮件识别是计算机取证领域的重要研究内容。多数垃圾邮件识别方法未能有效地考虑用户兴趣邮件识别结果的影响。提出了一种基于增量学习和主动学习的垃圾邮件识别新方法。为获得最有效特征,在特征选择阶段综合考虑了单词信息和非单词信息;接着,为减少待标注样本选择时间,提出了一种基于投影的不确定样本选择方法;最后,在样本标注过程中,提出了自动推荐样本类别及用户兴趣度的样本标注新方法。多种对比实验表明,算法针对垃圾邮件识别精度高,待标注样本选择速度较快,用户标注负担较小,具有较高的应用价值。
Spam identification is an important research content in computer forensics field. Most spam identification methods do not consider the effect of users~ interests on the identification results effectly. In this paper,a novel incre- mental learning and active learning based spam identification method was proposed. Firstly, for achieving the best fea- tures, the term information and non-term information was cosidered synthetically in the feature selection process. Sec- ondly,a projection based uncertain sample selection method was proposed for reducing the time of recommending samples to users for labeling. Finally, in the sample labeling process, a novel sample labeling method which can recom- mend the sample label and the user interest degree automatically was proposed. Many comparative experiments show that, the proposed method has high spam identification precision, quick speed of selecting the samples for labeling and low burden of sample labeling, proving the high value of the proposed method on practical application.
出处
《计算机科学》
CSCD
北大核心
2015年第B10期23-27,共5页
Computer Science
基金
本文受信息保障技术重点实验室开放基金项目(KJ-14-008)资助.
关键词
垃圾邮件识别
计算机取证
增量学习
主动学习
样本标注
用户兴趣度
Spam identification, Computer forensics, Incremental learning, Active learning, Sample labeling, User in- terest degree