摘要
对基于朴素贝叶斯算法的垃圾邮件过滤技术进行了研究分析和实验验证.介绍了向量空间模型(VSM)方法以及特征向量抽取方法,推导和研究了引入“特征之间互相独立”假设的朴素贝叶斯分类算法.采用K次交叉验证的方法,以收集的一些邮件为语料,应用朴素贝叶斯分类算法,通过训练集计算得到类别的先验概率和特征项的类条件概率,并以此为基础对测试集中的邮件进行归属判断,以正确率和召回率为指标给出了实验结果.
E-mail communications between people have been greatly affected by spare problem. In this paper, Naive Bayesian categorization algorithm is deduced and analyzed as well as its application and validation in the experiments of spam filtering. Firstly, the paper introduces Text categorization technique, including commonly used vector space model to represent the text and feature extraction methods, such as information gain and document frequency based method. What is more, the behavior of information gain method in the experiments is explained. Secondly, it deduces and analyzes Naive Bayesian with the premise of independence within features. Then, it uses mails collected before as corpus, utilize k-fold cross-validation, and applys the naive Bayesian in experiments. Based on probabilities and that of terms belonging to some category which are gained through training corpus, the paper categorizes mails from test corpus respectively. Finally, experimental result is shown by two indicators, precision and recall.
出处
《南京师范大学学报(工程技术版)》
CAS
2005年第4期61-64,共4页
Journal of Nanjing Normal University(Engineering and Technology Edition)
基金
江苏省自然科学基金资助项目(01KJD520005)