摘要
针对K-最近邻(KNN)分类算法在当前大数据背景下分类效率降低、分类效果不理想的问题,提出了一种基于Spark框架与词语相关度优化的高效KNN文本分类算法。在相似度计算过程中,采用词语相关度将文本词语间的关系考虑在内,对分类算法相似度计算进行优化,从而提高文本分类的准确度;依托Spark计算框架的内存处理机制,实现文本分类的并行化,从而提高KNN文本分类算法的处理效率,同时在并行化过程中建立类别-距离向量,以进一步加快文本分类的处理速度。实验结果表明,Spark框架下基于词语相关度的KNN文本分类算法在保证分类效果的基础上大大提高了分类效率,较Hadoop平台有较好的加速比,可有效地对大数据进行分类处理。
In viewof the problem that K-nearest neighbor(KNN) classification algorithm is not satisfactory and inefficient under the big data background,we put forward a highly efficient algorithm of KNN based on Spark framework and word relatedness.In the calculation of the similarity,taking into the relationship between the words account by using the word relatedness,the similarity calculation of the classification algorithm is optimized to improve the accuracy of the text classification.We rely on the in-memory mechanism of Spark to realize the parallelization of text categorization,so as to rise the efficiency of KNN text categorization algorithm. At the same time,the class-distance vector is established to further speed up the processing of text categorization in the calculation.The experiments showthat the proposed parallel algorithm could shorten the classification time on the basis of ensuring the classification effect. And it has better speedup,which can effectively classify the big data.
出处
《计算机技术与发展》
2018年第3期87-92,共6页
Computer Technology and Development
基金
国家自然科学基金(61402258)
山东省本科高校教学改革研究项目(2015M102)
校级教学改革研究项目(jg05021*)