摘要
文章利用LDA模型进行文本降维和特征提取,并将传统分类算法置于集成学习框架下进行训练,以探讨是否能提高单一分类算法的分类准确度,并获得较优的分类效果,使LDA模型能够发挥更高的性能和效果,从而为文本分类精度的提高服务。同时,以Web of Science为数据来源,依据其学科类别划分标准,建立涵盖6个主题的实验文本集,利用Weka作为实验工具,以平均F值作为评价指标,对比分析了朴素贝叶斯、逻辑回归、支持向量机、K近邻算法4种传统分类算法以及AdaBoost、Bagging、Random Subspace 3种集成学习算法的分类效果。从总体上看,通过“同质集成”集成后的文本分类准确率高于单个分类器的分类准确率;利用LDA模型进行文本降维和特征提取,将朴素贝叶斯作为基分类器,并利用Bagging进行集成训练,分类效果最优,实现了“全局最优”。
This study uses the LDA model to conduct dimension reduction and feature extraction for text and trains the traditional classification algorithm within the integrated learning framework, aiming to examine whether the accuracy of a single classification algorithm can be improved, obtain better effect of classification, maximize the function and effect of the LDA model, and improve the accuracy of text classification. Using Web of Science as the data source and based on its subject categories, an experimental text set covering 6 topics is established. Using Weka as the experimental tool and the average F value as the evaluation index, the performance of four traditional classification algorithms including naive Bayes, Logic Regression, SVM and KNN, and three ensemble learning algorithms including AdaBoost, Bagging and Random Subspace is compared and analyzed. Overall, through homogeneous integration, the accuracy rate of text classification after resembling is higher than that of a single classifier. Using the LDA model for text dimension reduction and feature extraction, naive Bayes as the base classifier, and Bagging for ensembled training has the best classification performance and can obtain global optimum.
作者
王万起
田中雨
董兰军
Wang Wanqi;Tian Zhongyu(Liaoning Technical University,Fuxin,Liaoning 123000,China)
出处
《高校图书馆工作》
2022年第2期41-46,共6页
Library Work in Colleges and Universities
关键词
文本分类
集成学习
算法比较
F值
主题模型
Text classification
Ensemble learning
Algorithm comparison
F value
Topic model