期刊文献+

Spark平台下的高效Web文本分类系统的研究 被引量:7

RESEARCH ON EFFICIENT WEB TEXT CLASSIFICATION SYSTEM BASED ON SPARK
下载PDF
导出
摘要 针对KNN分类算法在面对海量Web文本处理情况时在单机上训练和测试效率低下的问题,提出基于Hadoop分布式平台以及Spark并行计算模型的无中间结果输出的改进型Web文本分类系统。同时为了充分利用Spark的迭代计算能力,在文本向量化阶段,在传统TFIDF文本特征加权算法的基础上充分考虑特征项在类内和类间的信息分布,提出一种改进的特征加权算法。实验结果表明,该文本分类系统结合Spark计算模型在提高文本预处理、文本向量化以及KNN文本分类算法的性能上有着优异的表现。 In order to solve the problem of low efficiency of KNN classification algorithm in training and test on a single computer when facing the situation of processing massive Web texts,we proposed an improved Web text classification system without intermediate result output,which is based on Hadoop distributed platform and Spark parallel computing model. Meanwhile,in order to take full advantage of the computing power of Spark in iterative computation,at the stage of text vectorisation and on the basis of the traditional text feature weighting algorithm of TFIDF,we made the full consideration on the information distribution of the feature items within class and between class and proposed an improved feature weighting algorithm. Experimental results showed that this Web text classification system,in combination with Spark computing model,has excellent performance in improving text preprocessing,text vectorisation and the performance of KNN text classification algorithm.
作者 李涛 刘斌
出处 《计算机应用与软件》 CSCD 2016年第11期33-36,共4页 Computer Applications and Software
关键词 KNN TFIDF 文本分类 HADOOP SPARK KNN TFIDF Text classification Hadoop Spark
  • 相关文献

参考文献10

二级参考文献95

共引文献546

同被引文献49

引证文献7

二级引证文献41

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部