期刊文献+

一种基于粗糙集的Web文本分类方法

ON WEB TEXT CATEGORIZATION BASED ON ROUGH SET THEORY
下载PDF
导出
摘要 随着Web信息容量迅速膨胀,对Web文本分类已经是目前研究的热点。传统的Web文本分类对网页的预处理基本上没有考虑网页中的大量噪音,因此对分类结果有一定的影响;另一方面,文本的向量空间模型维数过高,对分类效果也存在很大的影响。提出一种基于粗糙集理论的Web文本分类方法,首先对网页进行去噪,然后对向量空间模型进行属性约简,之后构造分类器,实验表明,此方法不仅降低了维数,还提高了分类结果。 Along with the quick expanding of the capacities of web information, nowadays web text categorization has been a heating topic. Traditional web text categorization does not consider eliminating huge noises in web pages basically when preprocessing, which impacts the cat- egorization result to some extent. And on the other hand,too high dimensions in the vector space of text affect the categorization result a lot as well This paper presents a method of web text categorization based on rough set theory: First,the web pages are denoised,and then attributes reduction is carried out against the vector space model of web text,at last the classifier is constructed. The experiment shows that this method reduces the dimensions as well as improves the categorizing results.
出处 《计算机应用与软件》 CSCD 2009年第8期153-155,170,共4页 Computer Applications and Software
关键词 文本分类 噪音 向量空间模型 粗糙集 Text categorization Noise Vector space model Rough set
  • 相关文献

参考文献7

二级参考文献21

  • 1[1]Lin Shian-hua, Ho Jan-ming. Discovering informative content blocks from Web documents [A]. Proceeding of the 8th ACM SIG KDD International Conference on Knowledge Discovery and Data Mining [C]. Edmonton :ACM Press,2002.588 - 593.
  • 2[2]Yi Lan,Liu Bing, Li Xiao-li. Eliminating noisy information in Web pages for data mining [A]. Proceeding of the 8th ACM SIG KDD International Conference on Knowledge Discovery and Data Mining [C]. Washington, DC: ACM Press ,2003. 296 - 305.
  • 3[3]Kovacevic Milos, Dilligenti Michelangelo, Gori Marco,et al. Recognition of common areas in a Web page using a visualization approach [A]. Proceeding of the 10th International Conference on Artificial Intelligence: Methodology, Systems, Applications [C]. Varna: Springer,2002.203 - 212.
  • 4[4]Gupta Suhit, Kaiser Gail E, Neistadt David. et al. DOMbased content extraction of HTML documents [A].Proce-eding of the 12th International World Wide Web Conference [C]. Budapest: ACM Press ,2003. 207 - 214.
  • 5[5]Cai Deng, Yu Shi-peng, Wen Ji-rong, et al. Extracting content structure for Web pages Based on visual representation [A]. Proceeding of the 6th Asia Pacific Web Conference [C]. Xian: Springer,2003. 406 - 417.
  • 6S.E.Robers and S.Walker, Okapi/Keenbow at TREC8[A] .In:E.M. Voorhees and D.K.Harmann, editor, Proceedings of the Eighth Text Retrieval Conference(TREC- 8)[C] ,Gaithershurg,2000.
  • 7Yang Yiming, Pederson Jan O. A comparative study on feature selection in text categorization [A]. Proceedings of the 14th International Conference on Machine learning[C]. Bled: Morgan Kaufmann, 1997: 258-267.
  • 8Liu Tao, Liu Shengping, Chen Zheng. An evaluation on feature selection for text clustering [A]. Proceedings of the 20th International Conference on Machine learning[C]. Washington DC:2003.
  • 9Yang Yiming,Pederson J O.A Comparative Study on Feature Selection in Text Categorization [A].Proceedings of the 14th International Conference on Machine learning[C].Nashville:Morgan Kaufmann,1997:412-420.
  • 10Y.Yang.Noise reduction in a statistical approach to text categorization[A].Proceedings of the 18th Ann Int ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR95)[C].Seattle:ACM Press,1995:256-263.

共引文献227

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部