摘要
针对基于互补贝叶斯的分类算法在数据倾斜分布时由于过学习现象导致分类准确度不理想的状况,提出一种改进的互补贝叶斯分类算法。通过实验分析数据的倾斜分布对改进后的互补贝叶斯算法的影响,经验证该算法能够在数据倾斜分布时依然能保持较高的分类准确度,并且能随数据倾斜分布表现出较好的鲁棒性。讨论普通文本与Web文本的不同,建立一种带有主题权重的文档向量模型,并分析主题权重对文本算法的影响。实验发现,使用带有主题权重的文档向量模型,能够使得分类准确率相比普通的文本分类提高5%。
Focusing on the poor performance of complement naive Bayes algorithm on skewed data set,presents a modified complement naive Bayes algorithm by using a superior estimation for the prior class probability.Comprehensive experiments show that the modified complement naive Bayes algorithm exhibits excellent robustness to skewed data and achieves higher precision than any other naive Bayes algorithm.Furthermore,regards the difference between Web page classification and text classification,and presents a title weighted vector space model and analyses the effect of title weighted factor on classifier's precision.Experimental result shows that the precision is improved by 5% on average by using title weighted vector space model.
基金
国家863高科技项目(No.2008AA01Z119)