文本分类中的类别信息特征选择方法被引量：5

Class information feature selection method for text classification

下载PDF

导出

摘要随着网上电子文档的急剧增长,文本分类技术在信息检索中的应用变得日益重要.特征维数增加会使样本统计特性的评估变得更加困难,从而降低分类器的泛化能力,出现“过学习”的现象.因此,文档特征的选择和提取是文本分类的必要前提.提出一种基于类别信息的特征选择方法,该方法在尽量保留文档信息的同时,考虑了文档的类别信息.实验表明,这种方法的分类性能比较好,特别是在微平均指标上,与OCFS以及卡方统计量相比有较大幅度的提高. With the explosion of web documents, text classification becomes more important in Information Retrieval applications. It is very difficult to evaluate the statistical characteristics of samples because of the high dimensions. It will lead to ＂over study＂ and reduce classifiers＇ performance. So that feature selection and extraction before analysis are necessary. A class information feature selection method is proposed, in which the class information of the training document is taken into account while keeping as much decument information as possible. The experiments show that this method can get good performance, and it is consistently better than OCFS and CHI on macro average F1.

作者余俊英王明文盛俊

机构地区江西师范大学计算机信息工程学院

出处《山东大学学报（理学版）》 CAS CSCD 北大核心 2006年第3期10-13,59,共5页 Journal of Shandong University(Natural Science)

基金教育部重点科技资助项目(03070) 江西省自然科学基金资助项目(0311041)

关键词特征选择文本分类类间分布类内分布 feature selection text classification distribution between class distribution within class

分类号 TP18 [自动化与计算机技术—控制理论与控制工程]

引文网络
相关文献

参考文献11

1Sebastiani F. Machine learning in automated text categorization[J]. ACM Computing Survey, 2002, 34(1) : 1 - 47.
2Greengrass E. Information retrieval: A survey[Z]. Marryland:DOD Technical Report, 2000.
3Jolliffe I T. Principal component analysis[Z]. New York: Spriger Verlag, 1986.
4Martinez A M, Kak A C. PCA versus LDA[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2001,23(2) :228 - 233.
5Deerwester S, Dumais S T, Furnas G W, et al. Indexing by latent semantic analysis[J]. Journal of the American Society of Information Science, 1990, 41(6) :391 - 407.
6Jun Yan, Ning liu, Benyu Zhang. OCFS: Optimal orthogonal centroid feature selection for text categorization[Z]. Brazil: SIGIR, 2005.
7Howland P, Park H. Generalizing discriminant analysis using the generalized singular value decomposition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(8) :995- 1006.
8M Jeon, Park H, Rosen J B. Dimension reduction based on centroids and least squares for efficient processing of text data[Z]. Minnesota:CSE, 2001.
9James E Gentle, J Chambers, W Eddy, et al. Numerical linear algebra for applications in statistics[Z]. Berlin: Springer-Verlag, 1998.
10陈涛,谢阳群.文本分类中的特征降维方法综述[J].情报学报,2005,24(6):690-695. 被引量：79

二级参考文献26

1Apte C, Damerau F J, and Weiss S M. Automated learning of decision rules for text categorization. ACM Transactions on Information Systems, 1994, 12:233- 251.
2Yang Yiming, and Pedersen J O. A comparative study on feature selection in text categorization. In- Proceedings of the 14^th International Conference on Machine Learning (ICML-97), 1997. 412 - 420.
3Hwee Tou Ng, Wei Boon Goh, and Kok Leong Low. Feature selection, perceptron learning, and a usability case study for text categorization. In: Proceedings of the 20^th ACM International Conference on Research and Development in Information Retrieval (SIGIR-97), 1997. 67 - 73.
4Schutze H, Hull D A, and Pedersen J O. A comparison of classifiers and document representations for the routing problem. In: Proceedings of the 18^th ACM International Conference on Research and Development in Information Retrieval (SIGIR-95). 1995. 229 - 237.
5Li Y H, and Jain A K. Classification of text document. The Computer Journal, 1998, 41(8) :537 - 546.
6Deerwester S, Dumais S, Furnas D, et al. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 1990, 41 (6) : 391 - 407.
7Thomas Hofmann. Probabilistic latent semantic indexing. In:Proceedings of the 22^nd ACM International Conference on Research and Development in Information Retrieval (SIGIR-99), 1999. 50-57.
8Thomas K Landauer, Peter W Foltz, and Darrell Laham. An introduction to latent semantic analysis. Discourse Processes,1998, 25:259 - 284.
9Douglas L Baker, and Andrew Kachites McCallum.Distributional clustering of words for text classification. In:Proceedings of the 21^st ACM International Conference on Research and Development in Information Retrieval (SIGIR-98), 1998. 96 - 103.
10Lee D, Seung H. Learning tile Parts of Objects By Nonnegative Matrix Factorization. Nature, 1999,401:788- 791.

共引文献200

1陈丹雯,徐建军,谢毓湘,吴玲达.虚拟新闻自动生成系统的设计与实现[J].系统仿真学报,2006,18(z1):157-160.
2赵燕平,李超.网络安全信息挖掘中的特征选择与专利分析研究[J].中国管理科学,2004,12(z1):514-518. 被引量：3
3况夯,罗军.基于遗传FCM算法的文本聚类[J].计算机应用,2009,29(2):558-560. 被引量：5
4徐燕,李锦涛,王斌,孙春明,张森.不均衡数据集上文本分类的特征选择研究[J].计算机研究与发展,2007,44(z2):58-62. 被引量：20
5刘海峰,王元元,王倩.基于位置和类别结合模式的一种文本自动分类模型[J].图书情报工作,2006,50(S2):90-92.
6姜澜,李秀坤,单丽莉.一种新的词语权重计算方法[J].哈尔滨工业大学学报,2011,43(S1):315-318. 被引量：1
7李长虹,李堂秋.一种改进的特征选择方法在文本分类系统中的应用[J].学术问题研究,2005,0(1):94-98.
8施洁斌.基于支持向量机的文本自动分类试验研究[J].现代图书情报技术,2004(7):27-29.
9李国臣,段建勇.基于语法语义信息量化模型的语素字再分类[J].计算机工程,2004,30(11):37-39.
10鲁明羽,张红,付克明,陆玉昌.Web ME——一个大型网络挖掘环境系统[J].哈尔滨工业大学学报,2004,36(9):1164-1167. 被引量：1

同被引文献43

1徐燕,李锦涛,王斌,孙春明,张森.不均衡数据集上文本分类的特征选择研究[J].计算机研究与发展,2007,44(z2):58-62. 被引量：20
2陈治纲,何丕廉,孙越恒,郑小慎.基于向量空间模型的文本分类系统的研究与实现[J].中文信息学报,2005,19(1):36-41. 被引量：43
3张玉叶,李连,刘海见,王春歆.文本过滤中的特征抽取应用研究[J].海军航空工程学院学报,2005,20(1):139-141. 被引量：4
4钱晓东,王正欧.基于改进KNN的文本分类方法[J].情报科学,2005,23(4):550-554. 被引量：19
5叶惠敏,戴冠中.基于综合集成方法的网上舆论倾向分析与评估系统方案[J].计算机工程与应用,2005,41(16):216-217. 被引量：4
6寇莎莎,魏振军.自动文本分类中权值公式的改进[J].计算机工程与设计,2005,26(6):1616-1618. 被引量：25
7黄冉,郭嵩山.基于类别空间模型的文本分类系统的设计与实现[J].计算机应用研究,2005,22(8):60-63. 被引量：11
8罗欣,夏德麟,晏蒲柳.基于词频差异的特征选取及改进的TF-IDF公式[J].计算机应用,2005,25(9):2031-2033. 被引量：55
9赵世奇,张宇,刘挺,陈毅恒,黄永光,李生.基于类别特征域的文本分类特征选择方法[J].中文信息学报,2005,19(6):21-27. 被引量：21
10陈涛,谢阳群.文本分类中的特征降维方法综述[J].情报学报,2005,24(6):690-695. 被引量：79

引证文献5

1LI Yanling,DAI Guanzhong,ZHU Yehang,QIN Sen.A High-Performance Extraction Method for Public Opinion on Internet[J].Wuhan University Journal of Natural Sciences,2007,12(5):902-906. 被引量：3
2周炎涛,唐剑波,王家琴.基于信息熵的改进TFIDF特征选择算法[J].计算机工程与应用,2007,43(35):156-158. 被引量：28
3陈国松,黄大荣.基于信息熵的TFIDF文本分类特征选择算法研究[J].湖北民族学院学报（自然科学版）,2008,26(4):401-404. 被引量：5
4徐红国,王素格.基于改进的类别分布特征选择方法[J].中北大学学报（自然科学版）,2011,32(2):139-142.
5许琦.一种基于人工和机器学习相结合的教学网络资源分类方法[J].中国信息技术教育,2013(12):85-88. 被引量：1

二级引证文献36

1李艳玲,戴冠中,覃森.快速的文本倾向性分类方法(英文)[J].电子科技大学学报,2007,36(6):1232-1236. 被引量：2
2施聪莺,徐朝军,杨晓江.TFIDF算法研究综述[J].计算机应用,2009,29(B06):167-170. 被引量：218
3司红娜,姚力文,李向军.基于同义替换和相邻词合并的关键词特征权重计算新方法[J].计算机与现代化,2010(4):115-117. 被引量：1
4常凯.基于TF＊IDF垃圾邮件过滤改进算法的研究[J].电脑知识与技术,2010,6(9):6928-6930. 被引量：2
5贾晓倩,刘方爱.基于最近邻搜索算法分组式P2P网络拓扑模型[J].计算机技术与发展,2010,20(11):100-104. 被引量：3
6范会联,仲元昌,胡江坤,贾年龙.带信息熵反馈机制的免疫克隆文本聚类算法[J].郑州大学学报（理学版）,2011,43(1):46-49. 被引量：1
7许珂,蒙祖强,林啓峰.基于语义关联和信息增益的TFIDF改进算法研究[J].计算机应用研究,2012,29(2):557-560. 被引量：8
8马建国,杨金山,赵静,赵秀云.综合物探在宾州浸出油厂找水的应用[J].黑龙江水利科技,2000,28(2):64-65.
9雷军程,黄同成,柳小文.一种基于权重的文本特征选择方法[J].计算机科学,2012,39(7):250-252. 被引量：8
10刘海峰,于利军,刘守生.一种基于类别分布信息的文本特征选择模型[J].图书情报工作,2013,57(15):137-141. 被引量：5

1冀秀春,刘振会.Excel在统计学教学过程中的应用[J].中国信息技术教育,2010(4):89-89.
2徐燕,李锦涛,王斌,孙春明,张森.不均衡数据集上文本分类的特征选择研究[J].计算机研究与发展,2007,44(z2):58-62. 被引量：20
3潘立.搭建高可用性的韶钢MES系统基础构架[J].黑龙江科技信息,2008(28):75-75.
4张延祥,潘海侠.一种基于区分能力的多类不平衡文本分类特征选择方法[J].中文信息学报,2015,29(4):111-119. 被引量：7
5靖红芳,王斌,杨雅辉,徐燕.基于类别分布的特征选择框架[J].计算机研究与发展,2009,46(9):1586-1593. 被引量：18
6苗德成,张晓东,吴江,王博奇.一种基于数据挖掘技术的战机识别算法[J].战术导弹控制技术,2007,15(1):56-59. 被引量：1
7周宝通,李成龙,罗斌,汤进.视频监控中的自适应跟踪窗目标跟踪算法研究[J].计算机科学与探索,2013,7(9):848-853.
8赵禹.试论Excel表格在数据对比分析中的应用策略[J].计算机光盘软件与应用,2015,18(3):73-74.
9林美蓉.详解Oracle RAC的存储机制[J].网管员世界,2012(8):76-77.
10刘仲,王涌,章文嵩,邓鹍,王昭福.OCFS：一种基于对象存储结构的可伸缩高性能集群文件系统[J].通讯和计算机（中英文版）,2007,4(6):1-13.

山东大学学报（理学版）

2006年第3期

浏览历史

内容加载中请稍等...

文本分类中的类别信息特征选择方法被引量：5

参考文献11

二级参考文献26

共引文献200

同被引文献43

引证文献5

二级引证文献36

相关作者

相关机构

相关主题

浏览历史

文本分类中的类别信息特征选择方法 被引量：5

参考文献11

二级参考文献26

共引文献200

同被引文献43

引证文献5

二级引证文献36

相关作者

相关机构

相关主题

浏览历史

文本分类中的类别信息特征选择方法被引量：5