期刊文献+

Inductive Model Generation for Text Classification Using a Bipartite Heterogeneous Network 被引量:1

Inductive Model Generation for Text Classification Using a Bipartite Heterogeneous Network
原文传递
导出
摘要 Algorithms for numeric data classification have been applied for text classification. Usually the vector space model is used to represent text collections. The characteristics of this representation such as sparsity and high dimensionality sometimes impair the quality of general-purpose classifiers. Networks can be used to represent text collections, avoiding the high sparsity and allowing to model relationships among different objects that compose a text collection. Such network- based representations can improve the quality of the classification results. One of the simplest ways to represent textual collections by a network is through a bipartite heterogeneous network, which is composed of objects that represent the documents connected to objects that represent the terms. Heterogeneous bipartite networks do not require computation of similarities or relations among the objects and can be used to model any type of text collection. Due to the advantages of representing text collections through bipartite heterogeneous networks, in this article we present a text classifier which builds a classification model using the structure of a bipartite heterogeneous network. Such an algorithm, referred to as IMBHN (Inductive Model Based on Bipartite Heterogeneous Network), induces a classification model assigning weights to objects that represent the terms for each class of the text collection. An empirical evaluation using a large amount of text collections from different domains shows that the proposed IMBHN algorithm produces significantly better results than k-NN, C4.5, SVM, and Naive Bayes algorithms. Algorithms for numeric data classification have been applied for text classification. Usually the vector space model is used to represent text collections. The characteristics of this representation such as sparsity and high dimensionality sometimes impair the quality of general-purpose classifiers. Networks can be used to represent text collections, avoiding the high sparsity and allowing to model relationships among different objects that compose a text collection. Such network- based representations can improve the quality of the classification results. One of the simplest ways to represent textual collections by a network is through a bipartite heterogeneous network, which is composed of objects that represent the documents connected to objects that represent the terms. Heterogeneous bipartite networks do not require computation of similarities or relations among the objects and can be used to model any type of text collection. Due to the advantages of representing text collections through bipartite heterogeneous networks, in this article we present a text classifier which builds a classification model using the structure of a bipartite heterogeneous network. Such an algorithm, referred to as IMBHN (Inductive Model Based on Bipartite Heterogeneous Network), induces a classification model assigning weights to objects that represent the terms for each class of the text collection. An empirical evaluation using a large amount of text collections from different domains shows that the proposed IMBHN algorithm produces significantly better results than k-NN, C4.5, SVM, and Naive Bayes algorithms.
出处 《Journal of Computer Science & Technology》 SCIE EI CSCD 2014年第3期361-375,共15页 计算机科学技术学报(英文版)
基金 supported by So Paulo Research Foundation(FAPESP)of Brasil under Grant Nos.2011/12823-6,2011/23689-9,and 2011/19850-9
关键词 heterogeneous network text classification inductive model generation heterogeneous network, text classification, inductive model generation
  • 相关文献

参考文献52

  • 1Aggarwal C C, Zhai C. Mining Text Data. Springer, 2012.
  • 2Feldman R, Sanger J. The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge University Press, 2006.
  • 3Sebastiani F. Machine learning in automated text categoriza- tion. ACM Computing Surveys, 2002, 34(1): 1-47.
  • 4Manning C D, Raghavan P, Schiitze H.An Introduction to Information Retrieval. Cambridge University Press, 2008.
  • 5Schutze H, Hull D A, Pedersen J O. A comparison of clas- sifiers and document representations for the routing prob- lem. In Proc. the 18th Int. ACM SIGIR Conference on Re- search and Development in Information Retrieval, July 1995, pp.229-237.
  • 6Blanzieri E, Bryl A. A survey of learning-based techniques of email spam filtering. Artificial Intelligence Review, 2008, 29(1): 63-92.
  • 7Kao A, Quach L, Poteet S, Woods S. User assisted text clas- sification and knowledge management. In Proc. the 12th In- ternational Conference on Information and Knowledge Man- agement, November 2003, pp.524-527.
  • 8Han H, Giles C L, Manavoglu E, Zha H, Zhang Z, Fox E A. Automatic document metadata extraction using support vec- tor machines. In Proc. ACM/IEEE-CS Joint Conference on Digital Libraries, May 2003, pp.37-48.
  • 9Kessler B, Numberg G, Schiitze H. Automatic detection of text genre. In Proc. the 35th Annual Meeting of the Associa- tion for Computational Linguistics and the 8th Conference of the European Chapter of the Association for Computational Linguistics, August 1997, pp.32-38.
  • 10Dumais S, Chen H. Hierarchical classification of Web content. In Proc. the 23rd Annual International Conference on Re- search and Development in Infornation Retrieval, July 2000 pp.256-263.

同被引文献8

引证文献1

二级引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部