摘要
当训练集中各个类别的样本分布不均匀且存在数据稀疏问题时,朴素贝叶斯算法分类不够准确。针对此问题,提出了一种基于数据平滑与加权补集的朴素贝叶斯文本分类算法,该算法引入数据平滑算法计算贝叶斯模型中缺失特征的补偿概率,克服数据稀疏问题;利用当前类别补集的特征来表示当前类别的特征,解决训练集中各个类别的样本分布不均匀时,分类器容易倾向于大类别而忽略小类别的问题。实验结果表明,在样本集分布不均衡时,该算法比传统的朴素贝叶斯分类算法分类效果更好。
When training samples of each class are distributed unevenly and sparsely,the classification efficiency of Naive Bayes is not accurate enough. To solve this problem,a Naive Bayes text classification algorithm based on data smoothing and weighted complementary set was proposed,using data smoothing algorithm to calculate the compensation probability of the missing feature in Naive Bayes model,which can solve the data sparseness problem. Since training samples of each class are distributed unevenly,it uses features of current categories' complementary set to represent the features of current categories,which can solve the problem of recognizing the larger category and ignoring the smaller category. The experimental results show that the classification efficiency of the proposed algorithm is better than the traditional Naive Bayes when the training data set is uneven.
出处
《黑龙江大学自然科学学报》
CAS
北大核心
2015年第5期681-686,共6页
Journal of Natural Science of Heilongjiang University
基金
黑龙江省自然科学基金资助项目(ZD201403)
林业公益性行业科研专项经费(201504307)
关键词
朴素贝叶斯
文本分类
数据平滑
加权补集
Naive Bayes
text categorization
data smoothing
weighted complementary set