期刊文献+

基于云计算Hadoop平台的文本挖掘预处理方法 被引量:1

Preprocessing Method of Text Mining Based on Hadoop Platform
下载PDF
导出
摘要 随着信息社会的快速发展,网络数据正在指数级地增长,其中大部分都是文本数据.如何在有限的时间内完成大规模的文本数据挖掘分析,已成为当前的热点研究问题.文本预处理是整个挖掘过程中最耗时的环节,分布式并行处理可以缩短该过程的挖掘时间.设计分析了基于云计算Hadoop平台的文本预处理MapReduce并行化过程,并对预处理的Map函数和Reduce函数进行了详细介绍.通过实验证明,和单节点运行相比,改进后的并行化方法具有更好的性能. With the rapid development of information society, network data increase exponentially, and most of the network data exist in the form of text. It is a rescarch hotspot to mining and analyze the massive text data within the limited time. The text preprocessing is the longest step in the whole mining, and distributed parallel processing can shorten the pretreatment time. The MapReduce parallel improvement of the preprocessing was designd and analysed based on the Hadoop platform, and Map function and Reduce function were depicted in detail. The experiment results show that the improved parallel execution has better performance compared with the single node.
作者 张爱科 ZHANG Aike(Liuzhou Vocational and Technical College, Liuzhou 545006, Chin)
出处 《上海工程技术大学学报》 CAS 2017年第2期115-119,共5页 Journal of Shanghai University of Engineering Science
基金 广西教育厅科研资助项目(201204LX593) 广西中青年教师基础能力提升资助项目(KY2016LX516)
关键词 云计算 HADOOP平台 文本挖掘 文本预处理 分布式并行处理 cloud computing Hadoop platform text mining~ text preprocessing distributedparallel processing
  • 相关文献

参考文献4

二级参考文献16

  • 1刘云峰,齐欢,代建民,王小平.中文信息的潜在语义分析[J].华南理工大学学报(自然科学版),2004,32(z1):107-111. 被引量:5
  • 2黄昌宁,赵海.中文分词十年回顾[J].中文信息学报,2007,21(3):8-19. 被引量:249
  • 3Yang Yiming,Pederson J O.A Comparative Study on Feature Selection in Text Categorization [A].Proceedings of the 14th International Conference on Machine learning[C].Nashville:Morgan Kaufmann,1997:412-420.
  • 4Y.Yang.Noise reduction in a statistical approach to text categorization[A].Proceedings of the 18th Ann Int ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR95)[C].Seattle:ACM Press,1995:256-263.
  • 5Thorsten Joachims,Text Categorization with Support Vector Machines:Learning with Many Relevant Features[A],In:European Conferrence on Machine Learning (ECML)[C].Berlin:Springer,1998,137-142.
  • 6Mlademnic,D.,Grobelnik,M.Feature Selection for unbalanced class distribution and Nave Bayees[A].Proceedings of the Sixteenth International Conference on Machine Learning[C].Bled:Morgan Kaufmann,1999:258-267.
  • 7梁久祯 兰东俊 扈旻.基于先验知识的网页特征压缩与线性分类器设计[A]..第十二届全国神经计算学术大会论文集[C].北京:人民邮电出版社,2002.494-501.
  • 8詹姆斯,普雷斯.贝叶斯统计学原理、模型及应用[M].北京:中国统计出版社,1992.
  • 9Rabiner L R, Juang B H. An introduction to hidden Markov models [ J]. IEEE ASSP Mag, 1986,3 ( 1 ) :4 - 16.
  • 10Jurafsky D, Martin J H. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguisticsand Speech Recognition [ M ]. USA : Prentice Hall, 2000.

共引文献183

同被引文献5

引证文献1

二级引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部