摘要
文本分类是信息检索与数据挖掘的研究热点与核心技术,近年来得到了广泛的关注和快速的发展。近来年随着文本数据呈指数增长,要有效地管理这些数据,就必须在分布式环境下用有效的算法来处理这些数据。在Ha-doop分布式平台下实现了一简单有效的文本分类算法——TFIDF分类算法,即一种基于向量空间模型的分类算法,它用余弦相似度得到分类结果。在两个数据集上做了实验,结果表明,这一并行化算法在大数据集上很有效并可以在实际领域中得到良好的应用。
In recent years,there have been extensive studies and rapid progresses in automatic text categorization,which is one of the hotspots and key techniques in the information retrieval and data mining field.In recent years,as the text data grows exponentially,to effectively manage the large storage of data,we must use efficient algorithm to process it in the distributed environment.In this paper,we implemented a simple and effective text categorization algorithm on hadoop——TFIDF classifier,an algorithm based on vector space model,cosine similarity was applied as the metrics.The experiments on two datasets show that the parallel algorithm is effective on large storage of data and can be applied in practical application field.
出处
《计算机科学》
CSCD
北大核心
2011年第10期184-188,共5页
Computer Science
基金
国家自然科学基金项目(61035003
60875011)
科技部国际科技合作计划项目(2010DFA11030)
江苏省自然科学基金项目(BK2010054)资助