摘要
目前对以朴素贝叶斯算法为代表的文本分类算法,普遍存在特征权重一致,考虑指标单一等问题。为了解决这个问题,提出了一种基于TF-IDF的朴素贝叶斯改进算法TF-IDF-DL朴素贝叶斯算法。该算法以TF-IDF为基础,引入去中心化词频因子和特征词位置因子以加强特征权重的准确性。为了验证该算法的效果,采用了搜狗实验室的搜狗新闻数据集进行实验,实验结果表明,在朴素贝叶斯分类算法中引入TF-IDF-DL算法,能够使该算法在进行文本分类中的准确率、召回率和F 1值都有较好的表现,相比国内同类研究TF-IDF-dist贝叶斯方案,分类准确率提高8.6%,召回率提高11.7%,F 1值提高7.4%。因此该算法能较好地提高分类性能,并且对不易区分的类别也能在一定程度上达到良好的分类效果。
At present,the text classification algorithm represented by the naive Bayes algorithm generally has the same feature weights and single index.In order to solve this problem,we propose an improved TF-IDF-based naive Bayes algorithm,TF-IDF-DL naive Bayes algorithm.Based on TF-IDF,this algorithm introduces decentralized word frequency factor and feature word position factor to enhance the accuracy of feature weights.In order to verify its effect,we use Sogou’s Sogou news dataset to conduct experiments.The experiment shows that the TF-IDF-DL algorithm is introduced into the naive Bayesian classification algorithm,which can make the algorithm perform well in the accuracy,recall and F 1 value in text classification.Compared with the domestic similar research TF-IDF-dist Bayesian scheme,the classification accuracy rate is increased by 8.6%,the recall rate is increased by 11.7%,and the F 1 value is increased to 7.4%,so the proposed algorithm can improve the classification performance better and achieve a great classification effect to some extent for the indistinguishable categories.
作者
许甜华
吴明礼
XU Tian-hua;WU Ming-li(School of Informatics,North China University of Technology,Beijing 100144,China)
出处
《计算机技术与发展》
2020年第2期75-79,共5页
Computer Technology and Development
基金
国家自然科学基金(61672040)