摘要
目前,将机器学习、文本分类与信息过滤技术相结合的过滤方法成为研究热点。对实际邮件过滤时往往会遇到训练样本中包含大量未带类别标注的邮件,应用传统分类方法存在耗时且过滤性能差等问题,文章提出采用主动贝叶斯分类方法RANB对训练样本进行预处理,以标识其多类别;实验表明,这种方法可有效地提高训练样本质量,提高过滤器性能,在各项评价指标上具备优越性。
Current estimates indicate that nearly sixty percent of email traffic is regarded as spam and there is little reason to expect this to continue. Machine learning, text categorization and information filter can be effectively used to solve the problem. The proposed state-of the-art classification methods often label their classes firstly when there are a large number of unlabeled emails, which brings up heavy overhead of time and decreases the classification accuracy. Therefore. an active Bayesian classification technology RANB is proposed in this paper, which is used to label the classes of the unlabeled training emails as pretreatment. The experimental study shows that under the conditions of ensuring the capability of the filter in comparison with the classical methods, the method could effectively im- prove the quality of training samples and has better performance according to the appraisal standard.
出处
《合肥工业大学学报(自然科学版)》
CAS
CSCD
北大核心
2008年第9期1443-1446,共4页
Journal of Hefei University of Technology:Natural Science
基金
安徽省自然科学基金资助项目(050420207)
关键词
垃圾邮件
机器学习
文本分类
信息过滤
主动学习
贝叶斯分类
spam
machine learning
text categorization
information filter
active learning
naive Bayes classification