摘要
朴素贝叶斯算法在给定输出类别的情况下,需假设属性之间相互独立,然而现实中这个假设一般不成立,导致在属性个数较多或者属性之间相关性较大时,分类效果不是很理想。为了解决这个问题,本文采用优化的模糊C均值聚类及权重计算方法改进朴素贝叶斯算法。首先,基于JS散度构造类别个数的自适应函数优化模糊聚类算法,利用优化后的算法将文本分类整理。然后,采用词频因子优化的TF-IDF算法计算分类后各样本的特征权重,结合样本权重与贝叶斯公式,进行分类计算。最后,为了体现改进的朴素贝叶斯算法的有效性和优越性,将其与原始朴素贝叶斯算法以及其他改进算法进行对比实验。实验结果表明,改进后的算法有效地降低了朴素贝叶斯模型对特征项独立性的要求,提高了分类决策的准确率,且在分类性能和效率上具有一定的优越性。
In the case of a given output class,the naive Bayes algorithm assumes that the attributes are independent of each other.However,in reality,this assumption is usually not true.When the number of attributes is large or the correlation between attributes is high,the classification effect is not very good.In order to solve this problem,an optimized fuzzy C-means clustering and weight calculation method is used to improve the naive Bayes algorithm.Firstly,an adaptive function based on JS divergence is constructed to optimize the fuzzy clustering algorithm,and the optimized algorithm is used to sort the text.Then,the TF-IDF algorithm optimized by word frequency factor is used to calculate the feature weight of each sample after classification,and the classification calculation is carried out by combining the sample weight and Bayesian formula.Finally,in order to show the effectiveness and superiority of the improved naive Bayes algorithm,it is compared with the original naive Bayes algorithm and other improved algorithms.Experimental results show that the improved algorithm effectively reduces the requirements of the naive Bayes model for the independence of feature terms,improves the accuracy of classification decision-making,and has certain advantages in classification performance and efficiency.
作者
辛梓铭
王芳
XIN Ziming;WANG Fang(School of Science,Yanshan University,Qinhuangdao,Hebei 066004,China)
出处
《燕山大学学报》
CAS
北大核心
2023年第1期82-88,共7页
Journal of Yanshan University
基金
河北省自然科学基金资助项目(F2020203105)
河北省高等学校科学技术研究项目(ZD2022012)
国家自然科学基金资助项目(62073234)。
关键词
朴素贝叶斯
文本分类
模糊聚类
特征权重
独立性假设
naive Bayes
text classification
fuzzy clustering
feature weight
independence hypothesis