A fuzzy method to learn text classifier from labeled and unlabeled examples

A fuzzy method to learn text classifier from labeled and unlabeled examples

下载PDF

导出

摘要 In text classification, labeling documents is a tedious and costly task, as it would consume a lot of expert time. On the other hand, it usually is easier to obtain a lot of unlabeled documents, with the help of some tools like Digital Library, Crawler Programs, and Searching Engine. To learn text classifier from labeled and unlabeled examples, a novel fuzzy method is proposed. Firstly, a Seeded Fuzzy c-means Clustering algorithm is proposed to learn fuzzy clusters from a set of labeled and unlabeled examples. Secondly, based on the resulting fuzzy clusters, some examples with high confidence are selected to construct training data set. Finally, the constructed training data set is used to train Fuzzy Support Vector Machine, and get text classifier. Empirical results on two benchmark datasets indicate that, by incorporating unlabeled examples into learning process, the method performs significantly better than FSVM trained with a small number of labeled examples only. Also, the method proposed performs at least as well as the related method-EM with Nave Bayes. One advantage of the method proposed is that it does not rely on any parametric assumptions about the data as it is usually the case with generative methods widely used in semi-supervised learning. In text classification, labeling documents is a tedious and costly task, as it would consume a lot of expert time. On the other hand, it usually is easier to obtain a lot of unlabeled documents, with the help of some tools like Digital Library, Crawler Programs, and Searching Engine. To learn text classifier from labeled and unlabeled examples, a novel fuzzy method is proposed. Firstly, a Seeded Fuzzy c-means Clustering algorithm is proposed to learn fuzzy clusters from a set of labeled and unlabeled examples. Secondly, based on the resulting fuzzy clusters, some examples with high confidence are selected to construct training data set. Finally, the constructed training data set is used to train Fuzzy Support Vector Machine, and get text classifier. Empirical results on two benchmark datasets indicate that, by incorporating unlabeled examples into learning process, the method performs significantly better than FSVM trained with a small number of labeled examples only. Also, the method proposed performs at least as well as the related method-EM with Nave Bayes. One advantage of the method proposed is that it does not rely on any parametric assumptions about the data as it is usually the case with generative methods widely used in semi-supervised learning.

作者刘宏黄上腾

机构地区 Dept. of Computer Science

出处《Journal of Harbin Institute of Technology(New Series)》 EI CAS 2004年第1期98-102,共5页 哈尔滨工业大学学报（英文版）

关键词 text categorization FUZZY CLUSTERING 文本分类器文献标记文献检索模糊支持向量机

分类号 G354.4 [文化科学—情报学]

引文网络
相关文献

参考文献5

1Kamal Nigam,Andrew Kachites Mccallum,Sebastian Thrun,Tom Mitchell.Text Classification from Labeled and Unlabeled Documents using EM[J].Machine Learning (-).2000(2-3)
2NIGAM K.Text classification from labeled and unlabeled documents using EM[].Machine Learning.2000
3BENSAID A M.Partially supervised clustering for image segmentation[].Pattern Recognition.1996
4CRAVEN M.Learning to construct knowledge bases from the World Wide Web[].Artificial Intelligence.2000
5SEBASTIANI F.Machine learning in automated text categorization[].ACM Computing Surveys.2002

1陈骏.基于本体的语义网在数字图书馆中的应用[J].科技情报开发与经济,2007,17(34):61-62. 被引量：3
2姚彦茹.浅谈study与learn的区别[J].试题与研究（教学论坛）,2009(10):38-38.
3本刊讯.OCLC研究院发布《用户生活中的图书馆》[J].现代图书情报技术,2015(12):27-27.
4迪迪.“围观”今夏音乐选秀混战[J].黄金时代（下半月）,2013(8):71-73.
5张晓林.基于XML的信息组织与处理:2.应用技术[J].情报科学,2001,19(9):964-971. 被引量：4
6Dimitri Cozanitis.How would you like your egg？[J].英国医学杂志中文版,2009,12(6):375-375.
7黄显堂.基于本体的语义Web文本分类探讨[J].图书馆,2009(3):47-49.
8王毓川,王亚玲.运用VE和模糊数学对图书馆藏书进行综合评价[J].价值工程,1991(2):36-38. 被引量：1
9JOHAN BJORKSTEN.Media Training Is The Key To A Great Interview[J].China International Business,2009(12):50-50.
10张金松,陈燕,刘晓钟.基于主题模型的文献引用贡献分析[J].图书情报工作,2013,57(4):120-124. 被引量：5

Journal of Harbin Institute of Technology(New Series)

2004年第1期

浏览历史

内容加载中请稍等...

A fuzzy method to learn text classifier from labeled and unlabeled examples

参考文献5

相关作者

相关机构

相关主题

浏览历史