文本分类中基于基尼指数的特征选择算法研究被引量：38

Research on the Algorithm of Feature Selection Based on Gini Index for Text Categorization

下载PDF

导出

摘要随着网络的发展,大量的文档数据涌现在网上,用于处理海量数据的自动文本分类技术变得越来越重要,自动文本分类已成为处理和组织大量文档数据的关键技术.对于采用矢量空间模型(VSM)的大多数分类器来说,文本预处理成为分类的瓶颈,高维的特征空间对于大多数分类器来说是难以忍受的,因此采用适当的文本特征选择算法降低原始文本特征空间的维数成为文本分类的首要任务.目前也有很多的文本特征选择算法,介绍了另一种新的基于基尼指数的文本特征选择算法,使用基尼指数原理进行了文本特征选择的研究,构造了基于基尼指数的适合于文本特征选择的特征选择评估函数.实验表明,基于基尼指数的文本特征选择能进一步提高分类性能,而且计算复杂度小. With the rapid development of World Wide Web, large numbers of documents are available on the Internet. Automatic text categorization becomes more and more important for dealing with massive data. Text categorization has become a key technology in organizing and processing large amount of text data. For most classifiers using vector space model （VSM）, text preprocessing has become the bottleneck of categorization. High dimensionality of the feature space is impossible for many classifiers. So adopting appropriate text feature selection algorithms to reduce the dimensionality of the feature space is becoming the key role. At present, there are many text feature selection algorithms. In this paper, all these text feature selection methods are not discussed in detail, but another new text feature selection method--Gini index is presented, lmproved Gini-index is used for text feature selection, constructing the measure function based on Gini-index. The experiment results show that the text feature selection based on Gini index can improve the categorization performance further, and that its complexity of computing is small.

作者尚文倩黄厚宽刘玉玲林永民瞿有利董红斌

机构地区北京交通大学计算机与信息技术学院

出处《计算机研究与发展》 EI CSCD 北大核心 2006年第10期1688-1694,共7页 Journal of Computer Research and Development

基金国家自然科学基金项目(60503017) 北京交通大学人才基金项目(JSJ04002)~~

关键词文本分类文本特征选择基尼指数文本预处理 text categorization text feature selection Gini index text preprocessing

分类号 TP181 [自动化与计算机技术—控制理论与控制工程]

引文网络
相关文献

参考文献25

1T M Cover,P E Hart.Nearest neighbor pattern classification[J].IEEE Trans on Information Theory,1967,IT-13(1):21-27
2Y Yang.An evaluation of statistical approaches to text categorization[J].Information Retrieval,1999,1(1/2):67 -88
3Y Yang,X Lin.A re-examination of text categorization methods[C].The 22nd Annual Int'l ACM SIGIR Conf on Research and Development in the Information Retrieval,Berkeley,California,USA,1999
4B Masand,G Lino,D Waltz.Classifying news stories using memory based reasoning[C].The 15th Annual Int'l ACM SIGIR Conf on Research and Development in Information Retrieval,Copenhagen,Denmark,1992
5D D Lewis.Naive (Bayes) at forty:The independence assumption in information retrieval[C].The 10th European Conf on Machine Learning,Heidelberg,Germany,1998
6A Mccallum,K Nigam.A comparison of event models for naive bayes text classification[C].AAAI-98 Workshop on Learning for Text Categorization,Madison,Wisconsin,1998
7D D Lewis,M Ringuette.Comparison of two learning algorithms for text categorization[C].The 3rd Annual Symp on Document Analysis and Information Retrieval,Las Vegas,1994
8C Apte,F Damerau,S Weiss.Text mining with decision rules and decision trees[C].The Conf on Automated Learning and Discovery,Workshop 6:Learning from Text and the Web,Pittsburgh,PA,1998
9T Joachims.Text categorization with support vector machines:Learning with many relevant features[C].The 10th European Conf on Machine Learning,Heidelberg,Germany,1998
10Y Yang,C G Chute.An example-based mapping method for text categorization and retrieval[J].ACM Trans on Information System,1994,12(3):252-277

二级参考文献17

1D. D. Lewis. Naive (Bayes) at forty: The independence assumption in information retrieval. In: Proc. of the 10th European Conf. on Machine Learning. New York: Springer,1998, 4-15.
2Y. Yang, X. Lin. A re-examination of text categorization methods. In: The 22nd Annual Int'l ACM SIGIR Conf. onResearch and Development in the Information Retrieval. NewYork: ACM Press, 1999.
3Y. Yang, C. G. Chute. An example based mapping method for text categorization and retrieval. ACM Trans. on Information Systems, 1994, 12(3): 252 -277.
4E. Wiener. A neural network approach to topic spotting. The 4th Annual Syrup. on Document Analysis and Information Retrieval,Las Vegas, NV, 1995.
5R. E. Schapire, Y. Singer. Improved boosting algorithms using confidence-rated predications. In: Proc. of the 11th Annual Conf.on Computational Learning Theory. New York: ACM Press,1998. 80--91.
6T. Joachims. Text categorization with support vector machines:Learning with many relevant features. In: Proc. of the 10th European Conf. on Machine Learning. New York: Springer,1998. 137-142.
7Y. Yang. An evaluation of statistical approaches to text categorization. Information Retrieval, 1999, 1 ( 1 ) : 76-- 88.
8R. Adwait. Maximum entropy models for natural language ambiguity resolution: [ Ph. D. dissertation ] . Pennsylvania:University of Pennsylvania, 1998.
9R. Adwait. A maximum entropy model for part-of-speech tagging. The Empirical Methods in Natural Language Processing Conference, Philadelphia, USA, 1996.
10Adam L. Berger, Stephen A. Della Pietra, Vincent J. Della Pietra. A maximum entropy approach to natural language processing. Computational Linguistics, 1996, 22( 1 ) : 38-- 73.

共引文献217

1陈丹雯,徐建军,谢毓湘,吴玲达.虚拟新闻自动生成系统的设计与实现[J].系统仿真学报,2006,18(z1):157-160.
2赵燕平,李超.网络安全信息挖掘中的特征选择与专利分析研究[J].中国管理科学,2004,12(z1):514-518. 被引量：3
3徐燕,李锦涛,王斌,孙春明,张森.不均衡数据集上文本分类的特征选择研究[J].计算机研究与发展,2007,44(z2):58-62. 被引量：20
4姜澜,李秀坤,单丽莉.一种新的词语权重计算方法[J].哈尔滨工业大学学报,2011,43(S1):315-318. 被引量：1
5李长虹,李堂秋.一种改进的特征选择方法在文本分类系统中的应用[J].学术问题研究,2005,0(1):94-98.
6施洁斌.基于支持向量机的文本自动分类试验研究[J].现代图书情报技术,2004(7):27-29.
7李国臣,段建勇.基于语法语义信息量化模型的语素字再分类[J].计算机工程,2004,30(11):37-39.
8鲁明羽,张红,付克明,陆玉昌.Web ME——一个大型网络挖掘环境系统[J].哈尔滨工业大学学报,2004,36(9):1164-1167. 被引量：1
9王大亮,孙建涛,陆玉昌,夏克俭.一种面向自动文本摘要特征评价的新方法[J].计算机工程与应用,2004,40(33):176-178.
10刘志为,何丕廉,孙越恒,郑小慎.N层向量空间模型在Web信息检索中的应用[J].微型机与应用,2004,23(12):60-62. 被引量：5

同被引文献427

1YE Qiang LI Yijun ZHANG Yiwen.Semantic-Oriented Sentiment Classification for Chinese Product Reviews: An Experimental Study of Book and Cell Phone Reviews[J].Tsinghua Science and Technology,2005,10(z1):797-802. 被引量：7
2刘怀亮,张治国,马志辉,孙蕾.基于SVM与KNN的中文文本分类比较实证研究[J].情报理论与实践,2008,31(6):941-944. 被引量：10
3刘海峰,姚泽清,汪泽焱,张学仁.基于位置的文本特征加权方法研究[J].微电子学与计算机,2009,26(2):188-192. 被引量：9
4邓赵红,王士同,胡德文.适于癌基因表达数据集的新特征提取标准NFEC及其分类新算法研究[J].生物信息学,2004,2(2):13-20. 被引量：3
5徐燕,李锦涛,王斌,孙春明,张森.不均衡数据集上文本分类的特征选择研究[J].计算机研究与发展,2007,44(z2):58-62. 被引量：20
6侯汉清.分类法的发展趋势简论[J].情报科学,1981,2(1):58-63. 被引量：15
7邓林,马尽文,裴健.秩和基因选取方法及其在肿瘤诊断中的应用[J].科学通报,2004,49(13):1311-1316. 被引量：18
8刘群,张华平,俞鸿魁,程学旗.基于层叠隐马模型的汉语词法分析[J].计算机研究与发展,2004,41(8):1421-1429. 被引量：198
9袁时金,李荣陆,周水庚,胡运发.层次化中文文档分类[J].通信学报,2004,25(11):55-63. 被引量：6
10唐焕玲,孙建涛,陆玉昌.文本分类中结合评估函数的TEF-WA权值调整技术[J].计算机研究与发展,2005,42(1):47-53. 被引量：26

引证文献38

1任国锋,李德华,潘莹.一种改进的基尼指数特征权重算法[J].计算机与数字工程,2010,38(12):8-13. 被引量：1
2陈振洲,邹丽珊.基于改进SVM的特征选择[J].邵阳学院学报（自然科学版）,2007,4(1):58-63.
3陈景年,黄厚宽,田凤占,付树军.用于不完整数据的选择性贝叶斯分类器[J].计算机研究与发展,2007,44(8):1324-1330. 被引量：11
4林永民,朱卫东.基尼指数在文本特征选择中的应用研究[J].计算机应用,2007,27(10):2584-2586. 被引量：5
5徐燕,王斌,李锦涛,孙春明.知识增益：文本分类中一种新的特征选择方法[J].中文信息学报,2008,22(1):44-50. 被引量：6
6林永民,吕震宇,赵爽,朱卫东.向量空间模型中特征加权的研究[J].情报杂志,2008,27(3):5-7. 被引量：6
7徐燕,李锦涛,王斌,孙春明,张森.文本分类中特征选择的约束研究[J].计算机研究与发展,2008,45(4):596-602. 被引量：26
8吕震宇,林永民,赵爽,陈景年,朱卫东.基于类信息的文本特征选择与加权算法研究[J].计算机工程与应用,2008,44(20):145-147. 被引量：8
9赵长伟,孙素环,李晓培.基于语义相似度的文本表示降维方法[J].河南科技大学学报（自然科学版）,2008,29(5):36-39. 被引量：4
10张怡卓,刘亚秋,孙丽萍.基于自适应GA-SVR的中密度纤维板施胶比例辨识方法[J].东北林业大学学报,2008,36(9):56-58.

二级引证文献325

1刘辉,曾鹏飞,巫乔顺,陈甫刚.基于改进遗传算法的转炉炼钢过程数据特征选择[J].仪器仪表学报,2019,40(12):185-195. 被引量：17
2贺金龙,付立军,姚郑,吕鹏飞,黄徐胜.基于网格LSTM混合算法的地质领域用户意图识别[J].计算机系统应用,2020(10):44-52. 被引量：1
3王义,戴月明.基于混合互信息算法的文本情感分析[J].计算机应用研究,2020,37(2):337-341.
4陈敏鑫,刘石,孙单勋,刘兆宇.随机森林算法在温度分布重建中的应用[J].电子测量与仪器学报,2020,32(11):173-180. 被引量：3
5马立新,杨天笑,豆晨飞.分布式电源并网电能质量智能测评与方法研究[J].电子测量技术,2020,43(11):74-78. 被引量：6
6王荣荣.全局和局部特征提取相融合的中文文本特征提取方法研究[J].河北北方学院学报（自然科学版）,2013,29(3):35-38.
7翟东海,王佳君,聂洪玉,崔静静.基于互信息的热点词发现和突发性话题检测研究[J].西藏大学学报（社会科学版）,2013,28(4):82-87. 被引量：2
8任国锋,李德华,潘莹.一种改进的基尼指数特征权重算法[J].计算机与数字工程,2010,38(12):8-13. 被引量：1
9陈景年,黄厚宽,杨莉萍,田凤占.基于分布不完整数据选择性分类器[J].北京交通大学学报,2008,32(2):26-29. 被引量：1
10赵长伟,孙素环,李晓培.基于语义相似度的文本表示降维方法[J].河南科技大学学报（自然科学版）,2008,29(5):36-39. 被引量：4

1张蓉.Web挖掘技术研究[J].计算机工程,2006,32(15):4-6. 被引量：21
2包剑,冀常鹏,李义杰.基于矢量空间模型的文本自动分类系统研究[J].计算机系统应用,2005,14(3):47-49. 被引量：6
3徐建民,成岳鹏,辛丽军.一种基于术语簇和关联规则的文档聚类方法[J].计算机工程与应用,2007,43(5):178-181.
4赵伟燕,王静宇.基于MapReduce编程模型的TFIDF算法研究[J].微型机与应用,2013,32(4):71-73.
5董道国,薛向阳,罗航哉.多维数据索引结构回顾[J].计算机科学,2002,29(3):1-6. 被引量：9
6蒲筱哥.Web自动文本分类技术研究综述[J].情报学报,2009,28(2):233-241. 被引量：9
7丁磊,钱云涛.不同程度的监督机制在自动文本分类中的应用[J].计算机应用与软件,2004,21(6):65-68. 被引量：1
8党齐民,吕冬煜.基于词关联语义的文本分类研究[J].计算机应用,2004,24(4):62-63. 被引量：6
9卫琳.基于搜索结果的个性化推荐系统研究[J].计算机技术与发展,2007,17(9):65-67. 被引量：3
10宋枫溪,陈才扣,刘树海,杨静宇.文本表示方式对线性支持向量机分类性能的影响[J].模式识别与人工智能,2004,17(2):161-166. 被引量：4

计算机研究与发展

2006年第10期

浏览历史

内容加载中请稍等...

文本分类中基于基尼指数的特征选择算法研究被引量：38

参考文献25

二级参考文献17

共引文献217

同被引文献427

引证文献38

二级引证文献325

相关作者

相关机构

相关主题

浏览历史

文本分类中基于基尼指数的特征选择算法研究 被引量：38

参考文献25

二级参考文献17

共引文献217

同被引文献427

引证文献38

二级引证文献325

相关作者

相关机构

相关主题

浏览历史

文本分类中基于基尼指数的特征选择算法研究被引量：38