中文文本分类的两步特征选择法被引量：2

Two-step feature selection method on Chinese text categorization

下载PDF

导出

摘要针对从文集全局角度评价特征重要性的传统特征选择方法可能忽略某些重要分类特征的问题,提出两步特征选择方法.该方法首先过滤掉类别关联性不强的特征;然后根据词的统计信息将词归为各个类别的区分词,找出每个类的分类特征的最优子集;最后,将各个类别的最优子集组合起来形成最终分类特征.实验采用朴素贝叶斯作为分类器,使用IG,ECE,CC,MI和CHI等5种特征选择公式对该方法与传统方法进行比较,得到分类性能宏平均指标对比分别为91.075%对86.971%,91.122%对86.992%,91.160%对87.470%,90.253%对86.061%,90.881%对87.006%.该方法在考虑分类特征信息的同时,尽量保留传统特征选择方法中好的特征,能更好地捕获分类信息. Due to the fact that the traditional feature selection methods may ignore some important categorization features because they are not important enough from the perspective of whole corpus, a new two- step feature selection method is proposed. Firstly, the features which don＇ t have a strong relationship with categories are filtered. Secondly, the words are categorized to be taken as the category discriminating words of each category according to their statistic information and the optimal sub-sets are founded for the categorizations feature of every category. Finally, the optimal sub-sets of all the categories are combined to form the final categorization features： In the experiments, naive Bayesian is taken as categorizer and the method is compared with the traditional method by using the feature selection formulas IG, ECE, CC, MI and CHI. And the Macro-F1 obtained by the formulas are 91. 075%, 91. 122%, 91. 160%, 90.253% , 90. 881% in proposed method and are 86. 971%, 86. 992%, 87. 470%, 86. 061%, 87. 006% in the traditional method. Considering the categorization feature information, the method preserves the good features in the traditional methods as far as possible and can capture the categorization information better.

作者陈集樊兴华王鹏

机构地区重庆邮电大学计算机科学与技术研究所

出处《计算机辅助工程》 2008年第3期76-80,共5页 Computer Aided Engineering

基金国家自然科学基金(60703010) 重庆市自然科学基金(2006BB2374) 重庆市教委科学技术研究项目(KJ070519) 教育部回国留学人员启动基金(教外司留[2007]1109号)

关键词两步特征选择中文文本分类类别区分词朴素贝叶斯 two-step feature selection Chinese text categorization category discriminating word naive Bayesian

分类号 TP391.1 [自动化与计算机技术—计算机应用技术] TP181 [自动化与计算机技术—控制理论与控制工程]

引文网络
相关文献

参考文献5

1周茜,赵明生,扈旻.中文文本分类中的特征选择研究[J].中文信息学报,2004,18(3):17-23. 被引量：165
2SALTON G, WONG A, YANG C S. A vector space model for automatic indexing[J]. Commun ACM, 1975, 18(11) : 613-620.
3YANG Y M, PEDERSON J O. A comparative study on feature selection in text categorization [ C ]//Proc 14th Int Conf on Machine Learning, Nashville, USA, 1997 : 412-420.
4陈涛,谢阳群.文本分类中的特征降维方法综述[J].情报学报,2005,24(6):690-695. 被引量：79
5HWEE T N, WEI B G, KOK L L. Feature selection, perception learning, and a usability case study for text categorization[ C]//Proc 20th ACM Int Conf on Res & Dev in Inform Retrieval, 1997:67-73.

二级参考文献33

1Yang Yiming,Pederson J O.A Comparative Study on Feature Selection in Text Categorization [A].Proceedings of the 14th International Conference on Machine learning[C].Nashville:Morgan Kaufmann,1997:412-420.
2Y.Yang.Noise reduction in a statistical approach to text categorization[A].Proceedings of the 18th Ann Int ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR95)[C].Seattle:ACM Press,1995:256-263.
3Thorsten Joachims,Text Categorization with Support Vector Machines:Learning with Many Relevant Features[A],In:European Conferrence on Machine Learning (ECML)[C].Berlin:Springer,1998,137-142.
4Mlademnic,D.,Grobelnik,M.Feature Selection for unbalanced class distribution and Nave Bayees[A].Proceedings of the Sixteenth International Conference on Machine Learning[C].Bled:Morgan Kaufmann,1999:258-267.
5梁久祯兰东俊扈旻.基于先验知识的网页特征压缩与线性分类器设计[A]..第十二届全国神经计算学术大会论文集[C].北京:人民邮电出版社,2002.494-501.
6Apte C, Damerau F J, and Weiss S M. Automated learning of decision rules for text categorization. ACM Transactions on Information Systems, 1994, 12:233- 251.
7Yang Yiming, and Pedersen J O. A comparative study on feature selection in text categorization. In- Proceedings of the 14^th International Conference on Machine Learning (ICML-97), 1997. 412 - 420.
8Hwee Tou Ng, Wei Boon Goh, and Kok Leong Low. Feature selection, perceptron learning, and a usability case study for text categorization. In: Proceedings of the 20^th ACM International Conference on Research and Development in Information Retrieval (SIGIR-97), 1997. 67 - 73.
9Schutze H, Hull D A, and Pedersen J O. A comparison of classifiers and document representations for the routing problem. In: Proceedings of the 18^th ACM International Conference on Research and Development in Information Retrieval (SIGIR-95). 1995. 229 - 237.
10Li Y H, and Jain A K. Classification of text document. The Computer Journal, 1998, 41(8) :537 - 546.

共引文献235

1王细薇,樊兴华,赵军.一种基于特征扩展的中文短文本分类方法[J].计算机应用,2009,29(3):843-845. 被引量：36
2况夯,罗军.基于遗传FCM算法的文本聚类[J].计算机应用,2009,29(2):558-560. 被引量：5
3蒋宗礼,李宪雷,徐学可.基于主题Hub值的元搜索[J].北京工业大学学报,2009,35(3):397-402. 被引量：1
4黄健刚.基于J2ME的手机垃圾短信过滤器的研究[J].魅力中国,2009(26):169-170.
5尤晶晶.基于贝叶斯的垃圾邮件过滤优化算法[J].烟台职业学院学报,2008(2):80-83.
6刘海峰,王元元,王倩.基于位置和类别结合模式的一种文本自动分类模型[J].图书情报工作,2006,50(S2):90-92.
7王荣荣.全局和局部特征提取相融合的中文文本特征提取方法研究[J].河北北方学院学报（自然科学版）,2013,29(3):35-38.
8常娟.针对短文本数据的自动分类方法比较研究[J].消费导刊,2008,0(4):177-178.
9翟东海,王佳君,聂洪玉,崔静静.基于互信息的热点词发现和突发性话题检测研究[J].西藏大学学报（社会科学版）,2013,28(4):82-87. 被引量：2
10赵世奇,张宇,刘挺,陈毅恒,黄永光,李生.基于类别特征域的文本分类特征选择方法[J].中文信息学报,2005,19(6):21-27. 被引量：21

同被引文献9

1王秀娟,郭军,郑康锋.文本分类中一种新的特征选择方法[J].计算机应用,2005,25(3):661-663. 被引量：15
2G .SALTON, A.WONG, C.S.YANG. A vector space model for automatic indexing [J].Communications of the ACM, 1975.18 (11 ):613-620.
3IG.SALTONG, C.BUCKLEY. Term-weighting approaches in au- tomatic text retrieval [J]. Information Processing and Man- agement, 1988:513-523.
4Saltong G, Wong A, Yang C S. A vector space model for automatic in- dexing [J]. Communications of the ACM, 1975.18(11) :613 -620.
5Saltong G, Buckley. Term-weighting approaches in automatic text re- trieval [ J]. Information Processing and Management, 1988:513 - 523.
6熊忠阳,黎刚,陈小莉,陈伟.文本分类中词语权重计算方法的改进与应用[J].计算机工程与应用,2008,44(5):187-189. 被引量：28
7黄秀丽,王蔚.一种改进的文本分类特征选择方法[J].计算机工程与应用,2009,45(36):129-130. 被引量：10
8范小丽,刘晓霞.文本分类中互信息特征选择方法的研究[J].计算机工程与应用,2010,46(34):123-125. 被引量：30
9周茜,赵明生,扈旻.中文文本分类中的特征选择研究[J].中文信息学报,2004,18(3):17-23. 被引量：165

引证文献2

1张振浩,周奇年,杨继慧,徐登彩.中文文本自动分类中的特征选择改进与研究[J].工业控制计算机,2012,25(11):89-90.
2周奇年,张振浩,徐登彩.用于中文文本分类的基于类别区分词的特征选择方法[J].计算机应用与软件,2013,30(3):193-195. 被引量：8

二级引证文献8

1刘亚琛,刘曼,姚登峰.手语文本自动分词的设计与实现[J].智能计算机与应用,2013,3(6):81-84.
2李湘东,巴志超,黄莉.基于加权隐含狄利克雷分配模型的新闻话题挖掘方法[J].计算机应用,2014,34(5):1354-1359. 被引量：14
3陈立.基于贝叶斯文本分类的高校设备批量集中采购[J].实验技术与管理,2014,31(5):265-268. 被引量：2
4任军,葛卫丽,陈家勇.一种基于类差分度的互信息特征选择方法[J].中国科技论文,2015,10(20):2386-2389. 被引量：2
5孙玉强,巢碧霞.基于双重并行计算模型的TFIDF算法[J].计算机工程与设计,2016,37(11):3016-3021. 被引量：2
6姜文秀.基于分布式环境的数据挖掘算法研究[J].电脑知识与技术,2019,15(1Z):232-233. 被引量：1
7朱文峰,于舒娟,何伟.基于IG_CDmRMR的二阶段特征选择方法[J].计算机工程,2019,45(9):183-187. 被引量：2
8李富星,蒙祖强.一种改进的类别区分词特征选择算法[J].计算机与现代化,2019(3):73-77. 被引量：3

1喻军,孟晓玲.一种基于层次分析的特征选择法[J].中国科技信息,2006(10):266-267.
2周奇年,张振浩,徐登彩.用于中文文本分类的基于类别区分词的特征选择方法[J].计算机应用与软件,2013,30(3):193-195. 被引量：8
3周茜,赵明生,扈旻.中文文本分类中的特征选择研究[J].中文信息学报,2004,18(3):17-23. 被引量：165
4孙龙.服务器操作系统的先进性分析[J].科技风,2010(20):15-16. 被引量：2
5杨军,刘妍丽.基于图像的单样本人脸识别研究进展[J].西华大学学报（自然科学版）,2014,33(4):1-5. 被引量：7
6孙泽宇,丁国强,张永胜.基于能量有效WSN优化覆盖算法的研究[J].计算机应用研究,2011,28(6):2261-2264. 被引量：6
7王长军.CEC-I6538图形卡的设计[J].电子与电脑,1994,1(8):29-34.
8丁振国,黎靖,张卓.一种改进的基于神经网络的文本分类算法[J].计算机应用研究,2008,25(6):1639-1641. 被引量：4
9张葛祥,金炜东,胡来招.满意特征选择及其应用[J].控制理论与应用,2006,23(1):19-24. 被引量：5
10王欢,武刚,杨抒.基于文本分类的林业Web黄页分类系统[J].计算机系统应用,2012,21(1):21-24. 被引量：2

计算机辅助工程

2008年第3期

浏览历史

内容加载中请稍等...

中文文本分类的两步特征选择法被引量：2

参考文献5

二级参考文献33

共引文献235

同被引文献9

引证文献2

二级引证文献8

相关作者

相关机构

相关主题

浏览历史

中文文本分类的两步特征选择法 被引量：2

参考文献5

二级参考文献33

共引文献235

同被引文献9

引证文献2

二级引证文献8

相关作者

相关机构

相关主题

浏览历史

中文文本分类的两步特征选择法被引量：2