摘要
随着Internet等技术的飞速发展,信息处理已经成为人们获取有用信息不可或缺的工具,如何在海量信息中高效地获得有用信息至关重要,因此自动文本分类技术尤为重要。现有的文本分类算法在时间复杂性和空间复杂性上遇到瓶颈,不能满足人们的需求,为此提出了基于Hadoop分布式平台的TFIDF算法,给出了算法实现的具体流程,通过MapReduce编程实现了该算法,并在单机和集群模式下进行了对比实验,同时与传统串行算法进行了对比。实验证明,使用TFIDF文本分类算法可实现对海量数据的高速有效分类。
With the highspeed development of Internet, information processing has become an indispensable tool for peple ob- tain useful information. So automatic text classification technology is especially important. The existing classification algorithm in the time eomplexity and space complexity meet the bottleneck, and can't satisfy people's needs, this paper puts forward the TFIDF al- gorithm based on Hadoop distributed platform, and gives the specific process of the algorithm, through the MapReduce programming realized TFIDF classification algorithm, and compares with the traditional serial algorithm,also in single and cluster mode with contrast experiment, the experiment proved that, the use of text categorization algorithm TFIDF realize high-speed effective classification of mass data.
出处
《微型机与应用》
2013年第4期71-73,共3页
Microcomputer & Its Applications
基金
国家自然科学基金资助项目(61163025)
教育部春晖计划资助项目(Z2009-1-01044)