摘要
Web自动文本分类是信息检索与数据挖掘领域的研究热点与核心技术,近年来得到了广泛的关注和快速的发展。本文首先分析了国内外Web自动文本分类方法的研究现状,接着对新近出现的多分类器融合的方法、基于群的分类方法、基于RBF网络的文本分类模型、基于模糊-粗糙集的文本分类模型、潜在语义分类模型等新方法,以及K-近邻算法和支持向量机的新发展等进行了深入探讨;并对Web自动文本分类过程中的几个关键技术:文本预处理、文本表示、特征降维、训练方法和分类算法等进行了分析;最后总结了当前Web自动文本分类技术存在的问题及其发展趋势。
In recent years,there have been extensive studies and rapid progresses in automated text categorization,which is one of the hotspots and key techniques in the information retrieval and data mining field.This article has analyzed the research present situation of domestic and foreign Web text categorization method firstly,has analyzed the new methods which recently appeared,swarm-based approaches,based on the fuzzy-rough collection text classification model,the multi-sorters fusion method,based on RBF network text categorization model,latent semantic classification model and so on,as well as the recent development of the K-NN and the support vector machine(SVM)method;And has discussed the Web text categorization process several essential technologies:The text pretreatment,the text expressed,the characteristic fell Uygur,the training method and the classified algorithm;Finally summarized the development deficiency and tendency of Web automated text categorization technology.
出处
《情报学报》
CSSCI
北大核心
2009年第2期233-241,共9页
Journal of the China Society for Scientific and Technical Information
关键词
文本分类
分类方法
文本表示
特征选择
text categorization
categorization method
text representation
feature selection