摘要
用于文本处理的很多数据集已经达到TB、PB甚至更大规模,传统的单机方法难以对这些数据进行有效处理。近年来出现的MapReduce计算框架能够以简洁的形式和分布式的方案来解决大规模数据的并行处理问题,得到了学术界和工业界的广泛认可和使用。目前,MapReduce已经被用于自然语言处理、机器学习及大规模图处理等领域。该文首先对MapReduce做了简单的介绍,并分析了其特点、优势还有不足;然后对MapReduce近年来在文本处理各个方面的应用进行分类总结和整理;最后对MapReduce的系统和性能方面的研究也做了一些介绍与展望。
With the development of the internet,the text processing area is challenged to deal with web scale dataset.It is intractable for traditional approaches computing effectively on peta-scale data volumes.MapReduce emerged to address this issue with distributed and parallel processing methods,which has been widely recognized and studied both in the academic and in industry.In natural language processing,machine learning,large-scale graph processing and statistical machine translation,there have been many successful application of this technique.In this paper we first give a brief introduction to MapReduce,revealing its advantages,limitations,and differences with traditional techniques.Then we present a classification and summary to MapReduce applications in some aspects of text processing.Finally,we introduce the system and performance research of MapReduce and analyze possible applications of MapReduce in the future.
出处
《中文信息学报》
CSCD
北大核心
2012年第4期9-20,共12页
Journal of Chinese Information Processing
基金
自然科学基金资助项目(61070111)