摘要
针对高维数据含有的冗余特征影响机器学习训练效率和泛化能力的问题,为提升模式识别准确率、降低计算复杂度,提出了一种基于正则互表示(RMR)性质的无监督特征选择方法。首先,利用特征之间的相关性,建立由Frobenius范数约束的无监督特征选择数学模型;然后,设计分治-岭回归优化算法对模型进行快速优化;最后,根据模型最优解综合评估每个特征的重要性,选出原始数据中具有代表性的特征子集。在聚类准确率指标上,RMR方法与Laplacian方法相比提升了7个百分点,与非负判别特征选择(NDFS)方法相比提升了7个百分点,与正则自表示(RSR)方法相比提升了6个百分点,与自表示特征选择(SR_FS)方法相比提升了3个百分点;在数据冗余率指标上,RMR方法与Laplacian方法相比降低了10个百分点,与NDFS方法相比降低了7个百分点,与RSR方法相比降低了3个百分点,与SR_FS方法相比降低了2个百分点。实验结果表明,RMR方法能够有效地选出重要特征,降低数据冗余率,提升样本聚类准确率。
The redundant features of high-dimensional data affect the training efficiency and generalization ability of machine learning.In order to improve the accuracy of pattern recognition and reduce the computational complexity,an unsupervised feature selection method based on Regularized Mutual Representation(RMR)property was proposed.Firstly,the correlations between features were utilized to establish a mathematical model for unsupervised feature selection constrained by Frobenius norm.Then,a divide-and-conquer ridge regression optimization algorithm was designed to quickly optimize the model.Finally,the importances of the features were jointly evaluated according to the optimal solution to the model,and a representative feature subset was selected from the original data.On the clustering accuracy,RMR method is improved by 7 percentage points compared with the Laplacian method,improved by 7 percentage points compared with the Nonnegative Discriminative Feature Selection(NDFS)method,improved by 6 percentage points compared with the Regularized Self-Representation(RSR)method,and improved by 3 percentage points compared with the SelfRepresentation Feature Selection(SR_FS)method.On the redundancy rate,RMR method is reduced by 10 percentage points compared with the Laplacian method,reduced by 7 percentage points compared with the NDFS method,reduced by 3 percentage points compared with the RSR method,and reduced by 2 percentage points compared with the SR_FS method.The experimental results show that RMR method can effectively select important features,reduce redundancy rate of data and improve clustering accuracy of samples.
作者
汪志远
降爱莲
奥斯曼·穆罕默德
WANG Zhiyuan;JIANG Ailian;Osman MUHAMMAD(College of Information and Computer,Taiyuan University of Technology,Jinzhong Shanxi 030600,China)
出处
《计算机应用》
CSCD
北大核心
2020年第7期1896-1900,共5页
journal of Computer Applications
基金
山西省回国留学人员科研资助项目(2017-051)。
关键词
特征选择
无监督学习
分治算法
岭回归
正则化
feature selection
unsupervised learning
divide-and-conquer algorithm
ridge regression
regularization