结合优化文档频和变精度粗糙集的特征选择方法被引量：1

Feature Selection Method Combining Optimized Document Frequency with Variable Precision Rough Sets

下载PDF

导出

摘要在文本分类中,特征空间的维数通常高达几万,甚至远远超出训练样本的个数,这是一种十分普遍的现象.为了提高文本挖掘算法的运行速度,降低占用的内存空间,过滤掉不相关或相关程度低的特征,必须使用特征选择算法.首先给出了一个基于最小词频的文档频方法,然后把变精度粗糙集引入进来并提出了一个基于信息熵的属性约简算法,最后把该属性约简算法同基于最小词频的文档频方法结合起来,提出了一个综合的特征选择算法.该综合算法首先利用基于最小词频的文档频方法进行特征选择,然后利用所提属性约简算法消除冗余,从而获得较具代表性的特征子集.实验结果表明,该算法比最好的3种经典特征选择方法"互信息"和"统计量"以及文档频都要好. In text categorization, one problem is usually confronted with feature spaces containing 10,000 dimensions and more, even exceeding the number of available training samples. In order to enhance the operating speed and reduce the memory space occupied and filter out irrelevant or lower degree of features, feature selection algorithms must be used. In order to obtain more representative feature subset, it firstly presented document frequency method based on minimum word frequency, and then introduced variable precision rough sets and presented an algorithm of attribute reduction based on information entropy. Finally, the study combined the attribute reduction algorithm with document frequency method based on minimum word frequency and proposed a comprehensive feature selection algorithm. The comprehensive algorithm firstly used document frequency method based on minimum word frequency to select features, and then the attribute reduction algorithm to eliminate redundancy. Experimental results show that the comprehensive algorithm is better than Mutual Information and Chi-square Statistic and document frequency which are three best conventional feature selection measures.

作者朱颢东钟勇

机构地区中国科学院成都计算机应用研究所中国科学院研究生院

出处《河南大学学报（自然科学版）》 CAS 北大核心 2009年第5期515-520,共6页 Journal of Henan University:Natural Science

基金四川省科技计划项目(2008GZ0003) 四川省科技厅科技攻关项目(07GG006-014)

关键词特征选择最小词频文档频变精度粗糙集信息熵属性约简 feature selection minimum word frequency document frequency variable precision rough set information entropy attribute reduction

分类号 TP391.1 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献10

1Delgado M, Martin--Bautista M J, Sanchez D, et al. Mining text data., special features and patterns [C]. Proceedings of ESF Exploratory Workshop. London: U. K, 2002: 32--38.
2朱颢东,蔡乐才,刘忠英.一种改进的文本特征选择算法[J].现代电子技术,2008,31(8):97-99. 被引量：7
3张海龙,王莲芝.自动文本分类特征选择方法研究[J].计算机工程与设计,2006,27(20):3840-3841. 被引量：45
4周茜,赵明生,扈旻.中文文本分类中的特征选择研究[J].中文信息学报,2004,18(3):17-23. 被引量：165
5胡佳妮,徐蔚然,郭军,邓伟洪.中文文本分类中的特征选择算法研究[J].光通信研究,2005(3):44-46. 被引量：47
6Friedman N, Geiger D, Goldszmidt M. Bayesian Network Classifiers [J]. Machine Learning, 1997, 29(2) : 131--163.
7PAWLAK Z. Rough sets [J]. International Journal of Information and Computer Sciences, 1982, 11 (5):341--383.
8曾黄麟.智能计算[M].重庆:重庆大学出版社,2004..
9ZIARKO W. Variable precision rough set model [J]. Journal of Computer and System Science, 1993, 46(1) : 39--59.
10邹涛,王继成,朱华宇,金翔宇,张福炎.WWW上的信息挖掘技术及实现[J].计算机研究与发展,1999,36(8):1019-1024. 被引量：120

二级参考文献36

1李荣陆,王建会,陈晓云,陶晓鹏,胡运发.使用最大熵模型进行中文文本分类[J].计算机研究与发展,2005,42(1):94-101. 被引量：95
2寇莎莎,魏振军.自动文本分类中权值公式的改进[J].计算机工程与设计,2005,26(6):1616-1618. 被引量：25
3邹娟,周经野,邓成,刘玲.基于多重启发式规则的中文文本特征值提取方法[J].计算机工程与科学,2006,28(8):78-80. 被引量：3
4李水平，小型微型计算机系统，1998年，19卷，4期，74页
5Salton G，Commun ACM，1975年，18卷，11期，613页
6Yang Yiming, Pederson Jan O. A comparative study on feature selection in text categorization [A]. Proceedings of the 14th International Conference on Machine learning[C]. Bled: Morgan Kaufmann, 1997: 258-267.
7Liu Tao, Liu Shengping, Chen Zheng. An evaluation on feature selection for text clustering [A]. Proceedings of the 20th International Conference on Machine learning[C]. Washington DC:2003.
8Yang Yiming,Pederson J O.A Comparative Study on Feature Selection in Text Categorization [A].Proceedings of the 14th International Conference on Machine learning[C].Nashville:Morgan Kaufmann,1997:412-420.
9Y.Yang.Noise reduction in a statistical approach to text categorization[A].Proceedings of the 18th Ann Int ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR95)[C].Seattle:ACM Press,1995:256-263.
10Thorsten Joachims,Text Categorization with Support Vector Machines:Learning with Many Relevant Features[A],In:European Conferrence on Machine Learning (ECML)[C].Berlin:Springer,1998,137-142.

共引文献419

1姚学恒,张萍,闫立伟,操诚.基于机器学习的企业秘密文档自动分类方法[J].产业与科技论坛,2020,19(7):44-45.
2张卫丰,徐宝文,周晓宇,许蕾,李东.元搜索引擎结果生成技术研究[J].小型微型计算机系统,2003,24(1):34-37. 被引量：7
3王细薇,樊兴华,赵军.一种基于特征扩展的中文短文本分类方法[J].计算机应用,2009,29(3):843-845. 被引量：36
4丁志刚,王小捷.一种基于类别意图的信息检索模型[J].郑州大学学报（理学版）,2009,41(1):59-63.
5蒋宗礼,李宪雷,徐学可.基于主题Hub值的元搜索[J].北京工业大学学报,2009,35(3):397-402. 被引量：1
6宋聚平,王永成.搜索引擎中的信息存储技术[J].计算机工程,2000,26(S1):716-720.
7黄健刚.基于J2ME的手机垃圾短信过滤器的研究[J].魅力中国,2009(26):169-170.
8卢小华.WWW信息挖掘技术及其在水利水电领域中的应用[J].湖北水力发电,2003(1):69-71.
9周涛,李军,陆惠玲.WEB数据挖掘技术研究[J].汉中师范学院学报,2004,22(3):86-90. 被引量：1
10尤晶晶.基于贝叶斯的垃圾邮件过滤优化算法[J].烟台职业学院学报,2008(2):80-83.

同被引文献14

1胡佳妮,徐蔚然,郭军,邓伟洪.中文文本分类中的特征选择算法研究[J].光通信研究,2005(3):44-46. 被引量：47
2丁军,李凡,冯嘉礼.一种快速属性约简算法[J].华中科技大学学报（自然科学版）,2006,34(8):40-42. 被引量：8
3张海龙,王莲芝.自动文本分类特征选择方法研究[J].计算机工程与设计,2006,27(20):3840-3841. 被引量：45
4刘洋,冯博琴,周江卫.一种改进的基于差别矩阵的属性约简算法[J].微电子学与计算机,2007,24(5):133-135. 被引量：9
5Delgado M, Martin-Bautista M J, Sanchez D, et al. Mining text data: special features and pattems[C].UK, London:Proceedings of ESF Exploratory Workshop,2002:32-38.
6曾黄麟.智能计算[M].重庆:重庆大学出版社,2004..
7王柯,朱启兵.一种基于差别矩阵的启发式属性约简算法[J].计算机工程与科学,2008,30(6):73-75. 被引量：13
8周创德,田卫东.基于约束函数的差别矩阵及其求核算法[J].计算机工程,2008,34(15):60-62. 被引量：9
9张振琳,黄明.改进的差别矩阵及其求核方法[J].大连交通大学学报,2008,29(4):79-82. 被引量：6
10杨明,杨萍.基于广义差别矩阵的核和属性约简算法[J].控制与决策,2008,23(9):1049-1054. 被引量：19

引证文献1

1朱颢东,周姝,钟勇.结合差别对象对集的综合性特征选择方法[J].计算机工程与设计,2010,31(3):622-625.

1朱颢东,钟勇.基于优化的文档频和粗糙集的特征选择方法[J].湖南师范大学自然科学学报,2009,32(3):27-31. 被引量：5
2朱颢东,钟勇.基于优化的文档频和Beam搜索的特征选择方法[J].计算机科学,2009,36(11):196-199. 被引量：1
3朱颢东,钟勇.结合优化的文档频和LSA的特征选择方法[J].计算机工程与应用,2009,45(34):121-123. 被引量：1
4陈晓云,李荣陆,胡运发.基于最小词频阈值的文档特征选择[J].模式识别与人工智能,2006,19(4):531-537. 被引量：7
5朱颢东,钟勇.基于新型文档频和优化的Tabu搜索的特征选择[J].华中科技大学学报（自然科学版）,2010,38(2):4-7.
6张韬,朱颢东.基于优化文档频和信息量的特征选择方法[J].济南大学学报（自然科学版）,2009,23(4):363-367. 被引量：2
7王钊.基于SSH的Web中的文本挖掘算法的研究与应用[J].工业控制计算机,2015,28(9):128-129.
8朱颢东,蔡乐才,刘忠英.一种改进的文本特征选择算法[J].现代电子技术,2008,31(8):97-99. 被引量：7
9朱颢东,钟勇.基于贝叶斯粗糙集的文本特征选择方法[J].河南师范大学学报（自然科学版）,2009,37(4):31-35. 被引量：3
10朱颢东,钟勇.基于粗糙集与泛系等价算子的特征选择[J].计算机工程,2010,36(19):39-41. 被引量：1

河南大学学报（自然科学版）

2009年第5期

浏览历史

内容加载中请稍等...

结合优化文档频和变精度粗糙集的特征选择方法被引量：1

参考文献10

二级参考文献36

共引文献419

同被引文献14

引证文献1

相关作者

相关机构

相关主题

浏览历史

结合优化文档频和变精度粗糙集的特征选择方法 被引量：1

参考文献10

二级参考文献36

共引文献419

同被引文献14

引证文献1

相关作者

相关机构

相关主题

浏览历史

结合优化文档频和变精度粗糙集的特征选择方法被引量：1