摘要
互信息和朴素贝叶斯算法应用于垃圾邮件过滤时,存在特征冗余和独立性假设不成立的问题。为此,提出一种改进互信息的加权朴素贝叶斯算法。针对互信息效率较低的问题,通过引入词频因子与类间差异因子,提出一种改进的互信息特征选择算法,从而实现更高效的特征降维。针对朴素贝叶斯分类算法的独立性假设问题,在朴素贝叶斯分类时使用改进互信息值进行特征加权,消除部分朴素贝叶斯条件独立性假设对邮件分类的不利影响。实验结果表明,相比传统朴素贝叶斯算法,该算法提高了垃圾邮件过滤的精确度、召回率与稳定性。
The application of Mutual Information(MI)and Naive Bayes(NB)algorithm to spam filtering is faced with feature redundancy and invalid independence assumption.To address the problem,this paper proposes an Improved Mutual Information-Weighted Naive Bayes(IMI-WNB)algorithm.As for the low efficiency of mutual information,an improved feature selection algorithm based on MI is proposed by introducing the word frequency factor and inter-class difference factor in order to achieve more efficient feature dimensionality reduction.To solve the problem of independence assumption of NB classification algorithm,the Improved Mutual Information(IMI)value is used for feature weighting in NB classification,which eliminates the adverse effect of part of the NB conditional independence assumption on mail classification.The experimental results show that compared with the traditional NB algorithm,the proposed algorithm improves the accuracy,recall rate and stability of spam filtering.
作者
刘洁
王铮
王辉
LIU Jie;WANG Zheng;WANG Hui(School of Computer Science and Technology,Henan Polytechnic University,Jiaozuo,Henan 454000,China)
出处
《计算机工程》
CAS
CSCD
北大核心
2020年第12期299-304,312,共7页
Computer Engineering
基金
国家自然科学基金(61300216)。
关键词
互信息
垃圾邮件过滤
加权朴素贝叶斯算法
特征选择
词频
Mutual Information(MI)
spam filtering
Weighted Naive Bayes(WNB)algorithm
feature selection
word frequency