摘要
现实世界中高维数据无处不在,然而在高维数据中往往存在大量的冗余和噪声信息,这导致很多传统聚类算法在对高维数据聚类时不能获得很好的性能.实践中发现高维数据的类簇结构往往嵌入在较低维的子空间中.因而,降维成为挖掘高维数据类簇结构的关键技术.在众多降维方法中,基于图的降维方法是研究的热点.然而,大部分基于图的降维算法存在以下两个问题:(1)需要计算或者学习邻接图,计算复杂度高;(2)降维的过程中没有考虑降维后的用途.针对这两个问题,提出一种基于极大熵的快速无监督降维算法MEDR. MEDR算法融合线性投影和极大熵聚类模型,通过一种有效的迭代优化算法寻找高维数据嵌入在低维子空间的潜在最优类簇结构. MEDR算法不需事先输入邻接图,具有样本个数的线性时间复杂度.在真实数据集上的实验结果表明,与传统的降维方法相比, MEDR算法能够找到更好地将高维数据投影到低维子空间的投影矩阵,使投影后的数据有利于聚类.
High-dimensional data is widely adopted in the real world.However,there is usually plenty of redundant and noisy information existing in high-dimensional data,which accounts for the poor performance of many traditional clustering algorithms when clustering high-dimensional data.In practice,it is found that the cluster structure of high-dimensional data is often embedded in the lower dimensional subspace.Therefore,dimension reduction becomes the key technology of mining high-dimensional data.Among many dimension reduction methods,graph-based method becomes a research hotspot.However,most graph-based dimension reduction algorithms suffer from the following two problems:(1)most of the graph-based dimension reduction algorithms need to calculate or learn adjacency graphs,which have high computational complexity;(2)the purpose of dimension reduction is not considered in the process of dimension reduction.To address the problem,a fast unsupervised dimension reduction algorithm is proposed based on the maximum entropy-MEDR,which combines linear projection and the maximum entropy clustering model to find the potential optimal cluster structure of high-dimensional data embedded in low-dimensional subspace through an effective iterative optimization algorithm.The MEDR algorithm does not need the adjacency graph as an input in advance,and has linear time complexity of input data scale.A large number of experimental results on real datasets show that the MEDR algorithm can find a better projection matrix to project high-dimensional data into low-dimensional subspace compared with the traditional dimensionality reduction method,so that the projected data is conducive to clustering analysis.
作者
王继奎
杨正国
刘学文
易纪海
李冰
聂飞平
WANG Ji-Kui;YANG Zheng-Guo;LIU Xue-Wen;YI Ji-Hai;LI Bing;NIE Fei-Ping(College of Information Engineering,Lanzhou University of Finance and Economics,Lanzhou 730020,China;Center for Optical Imagery Analysis and Learning(OPTIMAL),Northwestern Polytechnical University,Xi’an 710072,China)
出处
《软件学报》
EI
CSCD
北大核心
2023年第4期1779-1795,共17页
Journal of Software
基金
国家自然科学基金(61772427,11801345)
甘肃省高等学校创新能力提升项目(2019B-97)
兰州财经大学校级重点项目(Lzufe2020B-0010,Lzufe2020B-011)。
关键词
无监督学习
线性降维
邻接图
聚类
极大熵
unsupervised learning
dimension reduction
adjacency graph
clustering
maximum entropy