摘要
文本分类是文本数据挖掘中的一个重要的内容,现阶段文本分类用到的主要算法有KNN,贝叶斯,神经网络等。KNN算法因为原理简单,分类效果较好,在文本分类中得到应用,但在数据量大时其运行效率上存在一定的局限性,本文提出一种基于中心抽样的KNN算法,并用20newsgroup数据集对其进行验证,在不影响准确率的情况下,提高了运行效率,取得了不错的效果。
Text classification is an important part of text data mining.At this stage,the main algorithms used in text categorization include KNN,Bayesian,neural networks.KNN algorithm for its simple ideas,with good efficiency,has an important application in text classification.But the KNN algorithm has certain limitation in the operation efficiency,especially dealing with a large amount of data.In this paper,a KNN algorithm based on central sampling is proposed,which is verified by 20newsgroup data set.It can improve the operation efficiency without affecting the accuracy,and has achieved good results.
作者
肖绍武
王子牛
高建瓴
XIAO Shaowu;WANG Ziniu;GAO Jianling(College of Big Data and Communication Engineering,Guizhou University,Guiyang 550025,China;Network and Information Management Center,Guizhou University,Guiyang 550025,China)
出处
《贵州大学学报(自然科学版)》
2018年第1期78-81,共4页
Journal of Guizhou University:Natural Sciences
基金
贵州省科学技术基金项目资助(黔科合J字[2015]2045)
贵州省档案局科研项目资助(2015D001)
贵州大学研究生创新基金项目资助(研理工2017016)