一种改进的互信息特征选择方法被引量：2

An Improved Feature Selection Algorithm Based on Mutual Information

下载PDF

导出

摘要特征选择是中文文本自动分类领域中极其重要的研究内容,其目的是为了解决特征空间高维性和文档表示向量稀疏性之间的矛盾。针对互信息(MI)特征选择方法分类效果较差的现状,提出了一种改进的互信息特征选择方法IMI。该方法考虑了特征项在当前文本中出现的频率以及互信息值为负数情况下的特征选取,从而能更有效地过滤低频词。通过在自动分类器KNN上的实验表明,改进后的方法极大地提高了分类精度。 Feature selection is extremely important research of automatic categorization, and its purpose is to solve the contradiction between the high dimensional feature space and sparse vector of the document. For the less effective classification results of mutual information feature selection method, an improved mutual information feature selection method, IMI,was presented. This method not only takes into the current frequency of feature in text, but also takes into the case of mutual information value is negative. Low frequency words can be filtered more effective. Experiments of automatic categorization based KNN show that IMI improves the classification accuracy.

作者康岚兰董丹丹 KANG Lan-lan,DONG Dan-dan (Faculty of Applied Science, Jiangxi University of Science and Technology, Ganzhou 341000, China)

机构地区江西理工大学应用科学学院

出处《电脑知识与技术》 2009年第12Z期9889-9890,共2页 Computer Knowledge and Technology

关键词中文文本自动分类特征选择互信息 automatic categorization feature selection mutual information

分类号 TP391.1 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献5

1Aha D W,Bankert R L.A comparative evaluation of sequential fea-ture selection algorithms[].Proceedings of theth International Workshop on Artificial Intelligence and Statistics.1995
2David D Lewis.An evaluation of phrasal and clustered representations on a text categorization task[].Proceedings of th ACM International Conference on Research and Development in Information Retrieval (SIGIR).1992
3Kohavi R,John G H.Wrappers for feature subset selection[].Artificial Intelligence.1997
4Yiming Yang,Jan O Pedersen.A comparative study on feature selection in text categorization[].Proceedings of the Fourteenth International Conference on Machine Learning (ICML’).1997
5Kennneth,W.C,P.Hanks.Word Association Norms,Mutual Information and Lexicogrphy[].Proceedings of ACL.1989

同被引文献17

1刘怀亮,张治国,马志辉,孙蕾.基于SVM与KNN的中文文本分类比较实证研究[J].情报理论与实践,2008,31(6):941-944. 被引量：10
2王秀娟,郭军,郑康锋.文本分类中一种新的特征选择方法[J].计算机应用,2005,25(3):661-663. 被引量：15
3卢新国,林亚平,陈治平.一种改进的互信息特征选取预处理算法[J].湖南大学学报（自然科学版）,2005,32(1):104-107. 被引量：12
4陈涛,谢阳群.文本分类中的特征降维方法综述[J].情报学报,2005,24(6):690-695. 被引量：79
5苏金树,张博锋,徐昕.基于机器学习的文本分类技术研究进展[J].软件学报,2006,17(9):1848-1859. 被引量：383
6寇苏玲,蔡庆生.中文文本分类中的特征选择研究[J].计算机仿真,2007,24(3):289-291. 被引量：30
7Yiming Yang,Jan O Pedersen.A comparative study on feature selection in text classification[C]//Proceedings of the 14th International Conference on Machine Learning(ICML).1997:412-420.
8George Forman.An extensive empirical study of feature selection metrics for text classification [J].Journal of Machine Learning Research,2003(3):1289-1305.
9Francois Paradis,Nie Jian-Yun.Contextual feature selection for text classification[J].Information Processing and Management,2007,43(2):344-352.
10刘海峰,王元元,张学仁.文本分类中一种改进的特征选择方法[J].情报科学,2007,25(10):1534-1537. 被引量：9

引证文献2

1胡改蝶,马建芬.文本分类中一种特征选择方法的改进[J].计算机与现代化,2011(5):20-21. 被引量：1
2朱世玲,郑彦.改进的文本特征选取算法研究[J].计算机技术与发展,2019,29(5):66-69. 被引量：2

二级引证文献3

1胡改蝶,樊孝仁,崔艺馨.文本分类中基于改进特征选择方法的研究[J].计算机与数字工程,2016,45(7):1290-1292. 被引量：1
2张翠肖,郝杰辉,刘星宇,孙月肖.基于CNN-BiLSTM的中文微博立场分析研究[J].计算机技术与发展,2020,30(7):154-159. 被引量：6
3魏媛媛,倪建成,高峰,吴俊清.结合主题信息聚类编码的文本摘要模型[J].计算机技术与发展,2021,31(1):30-34. 被引量：2

1詹川,卢显良,周旭,侯孟书,袁连海.基于贝叶斯公式的垃圾邮件过滤方法[J].计算机科学,2005,32(2):73-75. 被引量：11
2生海迪,段会川,孔超.基于语义短语的空间金字塔词袋模型图像分类方法[J].小型微型计算机系统,2015,36(4):877-881. 被引量：8
3冯进丽,杨红菊.基于BoC-BoF特征的图像检索方法研究[J].计算机科学,2015,42(4):297-301. 被引量：5
4康平波,王文杰.基于自动分类的搜索引擎过滤系统[J].计算机工程,2004,30(2):95-97. 被引量：2
5唐懿芳,牛力,傅赛香,严小卫.文本的自动分类[J].广西师范大学学报（自然科学版）,2001,19(4):50-55. 被引量：5
6陈磊,冯玉珉.一种基于网页自动分类的分类查询搜索引擎[J].电脑与信息技术,2004,12(6):47-51.
7陈骏.语义网在文本分类中的应用[J].计算机工程与应用,2009,45(8):153-157.
8蓝晓熙.MSN蛰伏多年携手新浪对抗腾讯暂难成功[J].IT时代周刊,2011(1):51-52.
9董乐红,耿国华,周明全.一个中文文本自动分类器的设计[J].计算机应用与软件,2008,25(4):14-16.
10董乐红,耿国华,周明全.基于Boosting算法的文本自动分类器设计[J].计算机应用,2007,27(2):384-386. 被引量：13

电脑知识与技术

2009年第12Z期

浏览历史

内容加载中请稍等...

一种改进的互信息特征选择方法被引量：2

参考文献5

同被引文献17

引证文献2

二级引证文献3

相关作者

相关机构

相关主题

浏览历史

一种改进的互信息特征选择方法 被引量：2

参考文献5

同被引文献17

引证文献2

二级引证文献3

相关作者

相关机构

相关主题

浏览历史

一种改进的互信息特征选择方法被引量：2