期刊文献+

基于Hadoop平台的海量文本分类的并行化 被引量:35

Parallel Text Categorization of Massive Text Based on Hadoop
下载PDF
导出
摘要 文本分类是信息检索与数据挖掘的研究热点与核心技术,近年来得到了广泛的关注和快速的发展。近来年随着文本数据呈指数增长,要有效地管理这些数据,就必须在分布式环境下用有效的算法来处理这些数据。在Ha-doop分布式平台下实现了一简单有效的文本分类算法——TFIDF分类算法,即一种基于向量空间模型的分类算法,它用余弦相似度得到分类结果。在两个数据集上做了实验,结果表明,这一并行化算法在大数据集上很有效并可以在实际领域中得到良好的应用。 In recent years,there have been extensive studies and rapid progresses in automatic text categorization,which is one of the hotspots and key techniques in the information retrieval and data mining field.In recent years,as the text data grows exponentially,to effectively manage the large storage of data,we must use efficient algorithm to process it in the distributed environment.In this paper,we implemented a simple and effective text categorization algorithm on hadoop——TFIDF classifier,an algorithm based on vector space model,cosine similarity was applied as the metrics.The experiments on two datasets show that the parallel algorithm is effective on large storage of data and can be applied in practical application field.
出处 《计算机科学》 CSCD 北大核心 2011年第10期184-188,共5页 Computer Science
基金 国家自然科学基金项目(61035003 60875011) 科技部国际科技合作计划项目(2010DFA11030) 江苏省自然科学基金项目(BK2010054)资助
关键词 文本分类 并行化 海量数据 HADOOP Text categorization Parallelization Massive data Hadoop
  • 相关文献

参考文献23

  • 1Sebastiani F. Text Categorization[Z]. Encyclopedia of Database Technologies and Applications. 2005..683-687.
  • 2Joachims T. A Probabilistic Analysis of the Rocchio Algorithm with TF1DF for Text Categorization[C]//Proceedings of the Fourteenth International Conference on Machine Learning. Morgan Kaufmann Publishers Inc. San Francisco, CA, USA, 1997.
  • 3Yang Y. An Evaluation of Statistical Approaches to Text Categorization[J]. Journal of Information Retrieval, 1999, 1 (1/2) :67-88.
  • 4Rocchio J J Jr. Relevance Feedback in Information Retrieval [M]. Salton G, ed. The SMART Retrieval System: Experiments in Automatic Document Processing. Prentice-Hall, Inc. , Englewood Cliffs, New Jersey, 1971 : 313-323.
  • 5Tzeras K, Hartmann S. Automatic Indexing Based on Bayesian Inference Networks[C]//Proc. 16th ACM Int. SIGIR Conference. 1993: 22-34.
  • 6Masand B, Lino G, Waltz D. Classifying News Stories Using Memory Based Reasoning[C]//15th ACM SIGIR Conference. 1992:59-65.
  • 7Apte C, Damerau F, Weiss S. Automated Learning of Decision Rules for Text Categorization[J]. ACM Trans. on Information Systems, 1994,12(3) : 233-251.
  • 8Joachims T. Text Categorization with Support Vector Machines:Learning with Many Relevant Features [C]//Proc. 10th European Conference on Machine Learning (ECML). 1998:137-142.
  • 9Salton G, Buckley C. Term Weighting Approaches in Automatic Text Retrieval [J]. Information Processing and Management, 1988,24(5) :513-523.
  • 10Kruengkrai C, Jaruskulchai C. A Parallel Leaming Algorithm for Text Classification[C]//Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. 2002:201-206.

二级参考文献12

共引文献61

同被引文献324

引证文献35

二级引证文献131

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部