Inductive Model Generation for Text Classification Using a Bipartite Heterogeneous Network 被引量：1

Inductive Model Generation for Text Classification Using a Bipartite Heterogeneous Network

导出

摘要 Algorithms for numeric data classification have been applied for text classification. Usually the vector space model is used to represent text collections. The characteristics of this representation such as sparsity and high dimensionality sometimes impair the quality of general-purpose classifiers. Networks can be used to represent text collections, avoiding the high sparsity and allowing to model relationships among different objects that compose a text collection. Such network- based representations can improve the quality of the classification results. One of the simplest ways to represent textual collections by a network is through a bipartite heterogeneous network, which is composed of objects that represent the documents connected to objects that represent the terms. Heterogeneous bipartite networks do not require computation of similarities or relations among the objects and can be used to model any type of text collection. Due to the advantages of representing text collections through bipartite heterogeneous networks, in this article we present a text classifier which builds a classification model using the structure of a bipartite heterogeneous network. Such an algorithm, referred to as IMBHN （Inductive Model Based on Bipartite Heterogeneous Network）, induces a classification model assigning weights to objects that represent the terms for each class of the text collection. An empirical evaluation using a large amount of text collections from different domains shows that the proposed IMBHN algorithm produces significantly better results than k-NN, C4.5, SVM, and Naive Bayes algorithms. Algorithms for numeric data classification have been applied for text classification. Usually the vector space model is used to represent text collections. The characteristics of this representation such as sparsity and high dimensionality sometimes impair the quality of general-purpose classifiers. Networks can be used to represent text collections, avoiding the high sparsity and allowing to model relationships among different objects that compose a text collection. Such network- based representations can improve the quality of the classification results. One of the simplest ways to represent textual collections by a network is through a bipartite heterogeneous network, which is composed of objects that represent the documents connected to objects that represent the terms. Heterogeneous bipartite networks do not require computation of similarities or relations among the objects and can be used to model any type of text collection. Due to the advantages of representing text collections through bipartite heterogeneous networks, in this article we present a text classifier which builds a classification model using the structure of a bipartite heterogeneous network. Such an algorithm, referred to as IMBHN （Inductive Model Based on Bipartite Heterogeneous Network）, induces a classification model assigning weights to objects that represent the terms for each class of the text collection. An empirical evaluation using a large amount of text collections from different domains shows that the proposed IMBHN algorithm produces significantly better results than k-NN, C4.5, SVM, and Naive Bayes algorithms.

作者 Rafael Geraldeli Rossi Alneu de Andrade Lopes Thiago de Paulo Faleiros Solange Oliveira Rezende

机构地区 Institute of Mathematics and Computer Science

出处《Journal of Computer Science & Technology》 SCIE EI CSCD 2014年第3期361-375,共15页 计算机科学技术学报（英文版）

基金 supported by So Paulo Research Foundation(FAPESP)of Brasil under Grant Nos.2011/12823-6,2011/23689-9,and 2011/19850-9

关键词 heterogeneous network text classification inductive model generation heterogeneous network, text classification, inductive model generation

分类号 TP391.1 [自动化与计算机技术—计算机应用技术] TP393 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献52

1Aggarwal C C, Zhai C. Mining Text Data. Springer, 2012.
2Feldman R, Sanger J. The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge University Press, 2006.
3Sebastiani F. Machine learning in automated text categoriza- tion. ACM Computing Surveys, 2002, 34(1): 1-47.
4Manning C D, Raghavan P, Schiitze H.An Introduction to Information Retrieval. Cambridge University Press, 2008.
5Schutze H, Hull D A, Pedersen J O. A comparison of clas- sifiers and document representations for the routing prob- lem. In Proc. the 18th Int. ACM SIGIR Conference on Re- search and Development in Information Retrieval, July 1995, pp.229-237.
6Blanzieri E, Bryl A. A survey of learning-based techniques of email spam filtering. Artificial Intelligence Review, 2008, 29(1): 63-92.
7Kao A, Quach L, Poteet S, Woods S. User assisted text clas- sification and knowledge management. In Proc. the 12th In- ternational Conference on Information and Knowledge Man- agement, November 2003, pp.524-527.
8Han H, Giles C L, Manavoglu E, Zha H, Zhang Z, Fox E A. Automatic document metadata extraction using support vec- tor machines. In Proc. ACM/IEEE-CS Joint Conference on Digital Libraries, May 2003, pp.37-48.
9Kessler B, Numberg G, Schiitze H. Automatic detection of text genre. In Proc. the 35th Annual Meeting of the Associa- tion for Computational Linguistics and the 8th Conference of the European Chapter of the Association for Computational Linguistics, August 1997, pp.32-38.
10Dumais S, Chen H. Hierarchical classification of Web content. In Proc. the 23rd Annual International Conference on Re- search and Development in Infornation Retrieval, July 2000 pp.256-263.

同被引文献8

1刘龙飞,杨亮,张绍武,林鸿飞.基于卷积神经网络的微博情感倾向性分析[J].中文信息学报,2015,29(6):159-165. 被引量：96
2陈钊,徐睿峰,桂林,陆勤.结合卷积神经网络和词语情感序列特征的中文情感分析[J].中文信息学报,2015,29(6):172-178. 被引量：49
3张军,张婷,杨正瓴,朱新山,杨伯轩.深度卷积神经网络的汽车车型识别方法[J].传感器与微系统,2016,35(11):19-22. 被引量：26
4Fei Hu,Li Li,Zi-Li Zhang,Jing-Yuan Wang,Xiao-Fei Xu.Emphasizing Essential Words for Sentiment Classification Based onRecurrent Neural Networks[J].Journal of Computer Science & Technology,2017,32(4):785-795. 被引量：13
5纪野,李玉惠,王蒙.基于卷积神经网络的车型识别方法研究[J].传感器与微系统,2017,36(11):42-43. 被引量：10
6代令令,蒋侃.基于fastText的中文文本分类[J].计算机与现代化,2018(5):35-40. 被引量：19
7曹琨,吴飞,骆立志,杨照坤,邬倩.基于条件生成对抗网络的人脸补全算法[J].传感器与微系统,2019,38(6):129-132. 被引量：5
8王艺杰.基于Fasttext的防控目标分类实现[J].中国公共安全（学术版）,2018(1):29-32. 被引量：7

引证文献1

1刘明明,李震霄,郑丽丽.基于双向循环神经网络的字符级文本分类[J].江苏建筑职业技术学院学报,2019,19(4):29-34. 被引量：1

二级引证文献1

1王诗怡,贺萍.复杂数据上的实体识别综述[J].计算机科学与应用,2021,11(5):1588-1597.

1刘宁.存储器分类使用的改进设想[J].中国金融电脑,2011(4):35-36.
2李诗诗,方寿海.基于Web使用挖掘技术的聚类算法改进[J].计算机工程与设计,2009,30(22):5182-5184. 被引量：5
3张恩利,侯振义.UPS用蓄电池的分类使用与维护[J].UPS应用,2004(12):29-31.
4宋杨,胡春燕,胡佳磊,侯维岩.Cosimulation Platform for Distributed Control System via Heterogeneous Network[J].Journal of Donghua University(English Edition),2016,33(5):729-733.
5Yishui Lin Tingting Yu.Research on Heterogeneous Network Security Devices[J].International Journal of Technology Management,2013(3):23-27.
6SHU Yong'an.Heterogeneous Networking Architecture Based on SDN[J].Chinese Journal of Electronics,2017,26(1):166-171. 被引量：1
7张云鹏,周连兵.基于类神经网络的垃圾邮件过滤技术研究设计[J].科学技术与工程,2006,6(23):4695-4699.
8Shilin Zhang Mei Gu.Using Improved Text Classification Technique to Acquire Job Opportunities for Disabled Persons[J].通讯和计算机（中英文版）,2010,7(3):44-49.
9ECHA改进信息分类使用更加便捷[J].现代职业安全,2016,0(2):49-49.
10刘勇.黑板报[J].电脑迷,2008,0(22):78-78.

Journal of Computer Science & Technology

2014年第3期

浏览历史

内容加载中请稍等...

Inductive Model Generation for Text Classification Using a Bipartite Heterogeneous Network 被引量：1

参考文献52

同被引文献8

引证文献1

二级引证文献1

相关作者

相关机构

相关主题

浏览历史