摘要
现在为人们所熟知的是单标签的分类,传统的监督学习的方法主要应用在单标签的数据中,但随着数据的日益丰富,单标签已经不能再完整地描述一个样本的信息,现在往往一条样本会对应多个标签,所以多标签数据的分类逐渐的成为数据挖掘的一个重要研究方向。虽然多标签能够更好地去描述一个样本的信息,但多标签数据通常是那种特征数目很大的数据,对这样的数据直接进行处理很困难,同时这些高维数据往往存在维度灾难的问题,所以对多标签数据进行分类之前做好数据的降维对最终的分类起着不可忽视的作用。提出一种基于采用条件互信息(最小冗余最大依赖准则,MDMR)来进行特征集的选择,去除无用的特征信息,然后通过一种改进的KNN算法对数据进行分类,实验表明这种方法使平均查全率提高2.5%。
Now,is well known that the classification of a single label,the traditional method of supervised learning are used in data in a single label,but the increasing rich data,single-label can no longer complete description of a sample of the information,a sample often can corresponds more tags todays,so multi-label classification data gradually become an important research direction of data mining.While many labels to better information to describe a sample,multi-label data is usually characterized by a large number of the kind of data,so it is difficult to process such data directly,and these high-dimensional data while there is often the curse of dimensionality problem,Data before doing so multi-label data classification dimension reduction on the final classification and plays an essential role.Presents for this condition based on the use of mutual information(Minimum Redundancy AND Maximum Dependent) to select the feature set,removing useless features information,and then through an improved KNN algorithm for data classification,experimental results show that this method is that the average recall rate increased by 2.5%.
关键词
单标签
多标签
条件互信息
特征提取
KNN算法
Single-Label
Multi-Label
Conditional Mutual Information
Feature Extraction
KNN Algorithm