摘要
随着Web信息容量迅速膨胀,对Web文本分类已经是目前研究的热点。传统的Web文本分类对网页的预处理基本上没有考虑网页中的大量噪音,因此对分类结果有一定的影响;另一方面,文本的向量空间模型维数过高,对分类效果也存在很大的影响。提出一种基于粗糙集理论的Web文本分类方法,首先对网页进行去噪,然后对向量空间模型进行属性约简,之后构造分类器,实验表明,此方法不仅降低了维数,还提高了分类结果。
Along with the quick expanding of the capacities of web information, nowadays web text categorization has been a heating topic. Traditional web text categorization does not consider eliminating huge noises in web pages basically when preprocessing, which impacts the cat- egorization result to some extent. And on the other hand,too high dimensions in the vector space of text affect the categorization result a lot as well This paper presents a method of web text categorization based on rough set theory: First,the web pages are denoised,and then attributes reduction is carried out against the vector space model of web text,at last the classifier is constructed. The experiment shows that this method reduces the dimensions as well as improves the categorizing results.
出处
《计算机应用与软件》
CSCD
2009年第8期153-155,170,共4页
Computer Applications and Software
关键词
文本分类
噪音
向量空间模型
粗糙集
Text categorization Noise Vector space model Rough set