摘要
随着信息社会的快速发展,网络数据正在指数级地增长,其中大部分都是文本数据.如何在有限的时间内完成大规模的文本数据挖掘分析,已成为当前的热点研究问题.文本预处理是整个挖掘过程中最耗时的环节,分布式并行处理可以缩短该过程的挖掘时间.设计分析了基于云计算Hadoop平台的文本预处理MapReduce并行化过程,并对预处理的Map函数和Reduce函数进行了详细介绍.通过实验证明,和单节点运行相比,改进后的并行化方法具有更好的性能.
With the rapid development of information society, network data increase exponentially, and most of the network data exist in the form of text. It is a rescarch hotspot to mining and analyze the massive text data within the limited time. The text preprocessing is the longest step in the whole mining, and distributed parallel processing can shorten the pretreatment time. The MapReduce parallel improvement of the preprocessing was designd and analysed based on the Hadoop platform, and Map function and Reduce function were depicted in detail. The experiment results show that the improved parallel execution has better performance compared with the single node.
作者
张爱科
ZHANG Aike(Liuzhou Vocational and Technical College, Liuzhou 545006, Chin)
出处
《上海工程技术大学学报》
CAS
2017年第2期115-119,共5页
Journal of Shanghai University of Engineering Science
基金
广西教育厅科研资助项目(201204LX593)
广西中青年教师基础能力提升资助项目(KY2016LX516)