摘要
针对朴素贝叶斯分类的属性独立性假设的不足,讨论了相关性及多变量相关的概念,给出词间相关度的定义。在TAN分类器的词间相关性分析基础上,提出一种文档特征词相关度估计公式及其在改进朴素贝叶斯分类模型中应用的算法,在Reuters-21578文本数据集上的实验表明,改进算法简单易行,能有效改进贝叶斯分类性能。
Aiming at the deficiency of Naive Bayes' attribute independence assumption,the concept of correlation and that between multi-variations were discussed,and the definition of correlation degree between terms was presented.Based on the analysis of the correlation between terms of TAN classifier,authors proposed a fomula to evaluate the correlation degree between document feature words and the algorithm of its application to ameliorating Naive Bayesian classifier.The experiments on Reuters- 21578 collection show the improvement of algorithm to be simple,effective and easy to implement.
出处
《计算机工程与应用》
CSCD
北大核心
2009年第16期159-161,共3页
Computer Engineering and Applications
关键词
文本分类
朴素贝叶斯
事件相关
相关度
树扩展型朴素贝叶斯分类器
text classification
Naive Bayes
event correlation
correlation degree
Tree Augmented Naive Bayes(TAN) classifier