期刊文献+

一种基于MapReduce的改进文本输入方式的并行分词方法研究

An Improved Text Input Method for Parallel Word Segmentation Bascd on Mapreduce
下载PDF
导出
摘要 中文分词方法都属于串行分词方法,不能处理海量数据。提出一种基于M印Reduce的并行分词方法。Mapreduce程模型默认使用TextI印utFomat文本输入方式,该方式不适合处理大量文本文件。首先基于CombineFilelnputFormat类,自定义文本输入方式MylnputFormat,并在实现createRecordReader方法过程中返回RecordReader对象。其次自定义MyReeordReader类来说明读取文本〈key,value〉键值对的具体逻辑。最后自定义MapReduce函数实现不同类别文本的分词结果。实验证明,基于改进后的MylnputFormat文本输入方式比默认的TextlnputFormat输入方式,更能处理大量文本文件。 Method of word segmentation is a serial process and it fails to deal with big data. We put forward a parallel word seg- mentation based on MapReduce. TextlnputFormat is the default input class when preprocessing in the programming model of Mapreduce, while it fails to process datasets which is made up of many small files. Firstly, we define a new class named Myln- putFormat based on the class of CombineFilelnputFormat,and return an object of RecordReader class. Secondly, we declare My- RecordReader class, by which can we write a new logic method to read and split the original data to 〈key, value〉 pairs when implementing the createRecordReader method. Last, we define our own mapreduce function, by which can we get the final seg- mentation results of different categories. The experimental results indicate that, compared with the default TextlnputFormat, My- InputForrnat saves much time to segment the text.
作者 徐宏博 赵文涛 孟令军 XU Hong-bo, ZHAO Wen-tao, MENG Ling-jun (College of Computer Science and Technology, Henan Polytechnic University, Jiaozuo 454000, China)
出处 《电脑知识与技术》 2016年第8期171-175,共5页 Computer Knowledge and Technology
关键词 MapReduc 分片 TextlnputFormat CombineFilelnputFormat MapReduce split TextlnputFormat CombineFilelnputFormat
  • 相关文献

参考文献12

二级参考文献108

  • 1黄昌宁,赵海.中文分词十年回顾[J].中文信息学报,2007,21(3):8-19. 被引量:249
  • 2Kang J H, Lerman K, Plangprasopchok A. Analyzing Microblogs with affinity propagation [C] //Proc of the 1st KDD Workshop on Social Media Analytic. New York: ACM, 2010:67-70.
  • 3Ramage D, Dumais S, Liebling D. Characterizing microblogs with topic models [C] //Proc of Int AAAI Conf on Weblogs and Social Media. Menlo Park, CA: AAAI, 2010:130-137.
  • 4Xu R, Wunsch D. Survey of clustering algorithms [J]. IEEE Trans on Neural Networks, 2005, 16(3): 645-678.
  • 5Deerwester S, Dumais S, Landauer T, et al. Indexing by latent semantic analysis [J]. Journal of the American Society of Information Science, 1990, 41(6): 391-407.
  • 6Landauer T K, Foltz P W, Laham D. Introduction to Latent Semantic Analysis [J]. Discourse Processes, 1998, 25 (2) 259-284.
  • 7Griffiths T, Steyvers M. Probabilistic topic models [G] // Latent Semantic Analysis: A Road to Meaning. Hillsdale, NJ: Laurence Erlbaum, 2006.
  • 8Hofmann T. Probabilistic latent semantic indexing [C] // Proc of the 22nd Annual Int ACM SIGIR Conf on Research and Development in Information Retrieval. New York: ACM, 1999:50-57.
  • 9Salton G, McGill M. Introduction to Modern Information Retrieval [M]. New York: McGraw-Hill, 1983.
  • 10Blei D M, Ng A Y, Jordan M I. Latent Dirichlet Allocation [J]. The Journal of Machine Learning Research, 2003, 3: 993-1022.

共引文献323

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部