基于MapReduce编程模型的TFIDF算法研究

Research of TFIDF algorithm based on the MapReduce programming model

下载PDF

导出

摘要随着Internet等技术的飞速发展,信息处理已经成为人们获取有用信息不可或缺的工具,如何在海量信息中高效地获得有用信息至关重要,因此自动文本分类技术尤为重要。现有的文本分类算法在时间复杂性和空间复杂性上遇到瓶颈,不能满足人们的需求,为此提出了基于Hadoop分布式平台的TFIDF算法,给出了算法实现的具体流程,通过MapReduce编程实现了该算法,并在单机和集群模式下进行了对比实验,同时与传统串行算法进行了对比。实验证明,使用TFIDF文本分类算法可实现对海量数据的高速有效分类。 With the highspeed development of Internet, information processing has become an indispensable tool for peple ob- tain useful information. So automatic text classification technology is especially important. The existing classification algorithm in the time eomplexity and space complexity meet the bottleneck, and can＇t satisfy people＇s needs, this paper puts forward the TFIDF al- gorithm based on Hadoop distributed platform, and gives the specific process of the algorithm, through the MapReduce programming realized TFIDF classification algorithm, and compares with the traditional serial algorithm,also in single and cluster mode with contrast experiment, the experiment proved that, the use of text categorization algorithm TFIDF realize high-speed effective classification of mass data.

作者赵伟燕王静宇

机构地区内蒙古科技大学信息工程学院内蒙古科技大学信息办与网络中心

出处《微型机与应用》 2013年第4期71-73,共3页 Microcomputer & Its Applications

基金国家自然科学基金资助项目(61163025) 教育部春晖计划资助项目(Z2009-1-01044)

关键词文本分类 MAPREDUCE 并行化 TFIDF算法 text classification MapReduce parallelization TFIDF algorithm

分类号 TP391.1 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献7

1SEBASTIANI F. Text categorization[A].2005.683-687.
2Yang Yiming. An evaluation of statistical approaches to text categorizationg[J].Journal of Information Retrieval,1999,(1/2):67-68.
3谢鑫军,何志均.一种单一表单工作流系统的设计和实现[J].计算机工程,1998,24(9):53-55. 被引量：8
4向小军,高阳,商琳,杨育彬.基于Hadoop平台的海量文本分类的并行化[J].计算机科学,2011,38(10):184-188. 被引量：35
5搜狐研发中心.Sogou文本分类语料库.
6刘鹏.实战Hadoop-开启通向云计算的捷径[M]北京:电子工业出版社,2011.
7李彬.基于Hadoop框架的TF-IDF算法改进[J].微型机与应用,2012,31(7):14-16. 被引量：12

二级参考文献23

1高洁,吉根林.文本分类技术研究[J].计算机应用研究,2004,21(7):28-30. 被引量：36
2Sebastiani F. Text Categorization[Z]. Encyclopedia of Database Technologies and Applications. 2005..683-687.
3Joachims T. A Probabilistic Analysis of the Rocchio Algorithm with TF1DF for Text Categorization[C]//Proceedings of the Fourteenth International Conference on Machine Learning. Morgan Kaufmann Publishers Inc. San Francisco, CA, USA, 1997.
4Yang Y. An Evaluation of Statistical Approaches to Text Categorization[J]. Journal of Information Retrieval, 1999, 1 (1/2) :67-88.
5Rocchio J J Jr. Relevance Feedback in Information Retrieval [M]. Salton G, ed. The SMART Retrieval System: Experiments in Automatic Document Processing. Prentice-Hall, Inc. , Englewood Cliffs, New Jersey, 1971 : 313-323.
6Tzeras K, Hartmann S. Automatic Indexing Based on Bayesian Inference Networks[C]//Proc. 16th ACM Int. SIGIR Conference. 1993: 22-34.
7Masand B, Lino G, Waltz D. Classifying News Stories Using Memory Based Reasoning[C]//15th ACM SIGIR Conference. 1992:59-65.
8Apte C, Damerau F, Weiss S. Automated Learning of Decision Rules for Text Categorization[J]. ACM Trans. on Information Systems, 1994,12(3) : 233-251.
9Joachims T. Text Categorization with Support Vector Machines:Learning with Many Relevant Features [C]//Proc. 10th European Conference on Machine Learning (ECML). 1998:137-142.
10Salton G, Buckley C. Term Weighting Approaches in Automatic Text Retrieval [J]. Information Processing and Management, 1988,24(5) :513-523.

共引文献51

1李艳平,徐雅斌,陈俊伊.搜索服务中基于云计算的垃圾网页识别研究[J].华中科技大学学报（自然科学版）,2012,40(S1):249-253.
2刘海川,陈培久.基于ASP.NET和XML的工作流管理系统的设计与实现[J].计算机工程与应用,2004,40(15):214-217. 被引量：7
3蒋海彦,鄂明成,习中革.基于Web的表单工作流系统设计与实现[J].北方交通大学学报,2004,28(4):106-110. 被引量：4
4许志华,蔡泽祥,李志兴.按需求自重构的通用流程管理平台设计[J].电力信息化,2006,4(8):93-96. 被引量：1
5许志华,蔡泽祥,李志兴.按需求自重构的通用流程管理平台[J].农村电气化,2006(10):44-47.
6许志华,蔡泽祥,李志兴.按需求自重构的通用流程管理平台设计与实现[J].广东电力,2006,19(12):47-50.
7叶敏挺,李博.B/S和C/S双架构通用表单编辑器的设计与实现[J].计算机应用,2010,30(12):332-334. 被引量：3
8王刚,程建平.面向业务的敏捷界面定制构件的设计[J].微型机与应用,2012,31(7):11-13. 被引量：2
9张广弟,汪秀兵,胡亚磊.基于hadoop的DEM格网建立研究[J].科技视界,2012(28):95-95. 被引量：1
10王博,陈莉君.Hadoop远程过程调用机制的分析和应用[J].西安邮电学院学报,2012,17(6):74-77. 被引量：10

1Web应用下的集群基础技术分析——Session故障恢复[J].计算机与网络,2007,33(3):106-107.
2李天翼,许鲁,常致全.一种新型的基于网络存储的Web集群解决方案[J].计算机应用研究,2003,20(10):78-79. 被引量：6
3王静宇,赵伟燕.基于Hadoop平台的TFIDF算法并行化研究[J].计算机工程与科学,2014,36(6):1018-1022. 被引量：2
4李俊青,周恩,符天.服务器集群及负载均衡的研究[J].电脑知识与技术（过刊）,2009,15(9X):7391-7392. 被引量：4
5林美华.刀片服务器—下一代的服务器[J].电脑学习,2003(4):4-5. 被引量：2
6尚文倩,黄厚宽,刘玉玲,林永民,瞿有利,董红斌.文本分类中基于基尼指数的特征选择算法研究[J].计算机研究与发展,2006,43(10):1688-1694. 被引量：38
7蒲筱哥.Web自动文本分类技术研究综述[J].情报学报,2009,28(2):233-241. 被引量：9
8丁磊,钱云涛.不同程度的监督机制在自动文本分类中的应用[J].计算机应用与软件,2004,21(6):65-68. 被引量：1
9宋枫溪,陈才扣,刘树海,杨静宇.文本表示方式对线性支持向量机分类性能的影响[J].模式识别与人工智能,2004,17(2):161-166. 被引量：4
10李锴.医院信息系统的备份与容灾系统设计[J].网络安全技术与应用,2013(11):44-45.

微型机与应用

2013年第4期

浏览历史

内容加载中请稍等...

基于MapReduce编程模型的TFIDF算法研究

参考文献7

二级参考文献23

共引文献51

相关作者

相关机构

相关主题

浏览历史