期刊文献+

基于Spark与词语相关度的KNN文本分类算法 被引量:3

KNN Text Classification Based on Word Relatedness and Spark Framework
下载PDF
导出
摘要 针对K-最近邻(KNN)分类算法在当前大数据背景下分类效率降低、分类效果不理想的问题,提出了一种基于Spark框架与词语相关度优化的高效KNN文本分类算法。在相似度计算过程中,采用词语相关度将文本词语间的关系考虑在内,对分类算法相似度计算进行优化,从而提高文本分类的准确度;依托Spark计算框架的内存处理机制,实现文本分类的并行化,从而提高KNN文本分类算法的处理效率,同时在并行化过程中建立类别-距离向量,以进一步加快文本分类的处理速度。实验结果表明,Spark框架下基于词语相关度的KNN文本分类算法在保证分类效果的基础上大大提高了分类效率,较Hadoop平台有较好的加速比,可有效地对大数据进行分类处理。 In viewof the problem that K-nearest neighbor(KNN) classification algorithm is not satisfactory and inefficient under the big data background,we put forward a highly efficient algorithm of KNN based on Spark framework and word relatedness.In the calculation of the similarity,taking into the relationship between the words account by using the word relatedness,the similarity calculation of the classification algorithm is optimized to improve the accuracy of the text classification.We rely on the in-memory mechanism of Spark to realize the parallelization of text categorization,so as to rise the efficiency of KNN text categorization algorithm. At the same time,the class-distance vector is established to further speed up the processing of text categorization in the calculation.The experiments showthat the proposed parallel algorithm could shorten the classification time on the basis of ensuring the classification effect. And it has better speedup,which can effectively classify the big data.
出处 《计算机技术与发展》 2018年第3期87-92,共6页 Computer Technology and Development
基金 国家自然科学基金(61402258) 山东省本科高校教学改革研究项目(2015M102) 校级教学改革研究项目(jg05021*)
关键词 K-最近邻 词语相关度 SPARK 并行化计算 KNN word relatedness Spark parallel computing
  • 相关文献

参考文献7

二级参考文献75

共引文献449

同被引文献13

引证文献3

二级引证文献4

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部