深度聚类通过联合深度学习和传统的聚类方法,可以有效解决高维数据聚类问题,在数据处理领域受到广泛关注,然而,需要花费大量计算资源的深度聚类模型往往会制约其研究发展乃至应用。因此,本文针对深度聚类模型训练耗费时间过长的问题,从...深度聚类通过联合深度学习和传统的聚类方法,可以有效解决高维数据聚类问题,在数据处理领域受到广泛关注,然而,需要花费大量计算资源的深度聚类模型往往会制约其研究发展乃至应用。因此,本文针对深度聚类模型训练耗费时间过长的问题,从减少单次迭代时间和缩短达到期望精度的迭代次数两个思路去探索提高模型训练效率的方法,分别提出了基于随机采样策略的深度K-means (Deep K-means based on Random Sampling Strategy, RSDK)和基于正交变换特征的二阶段深度K-means (Two Stage Deep K-means based on Orthogonal Transform Features, OTDK),前者利用随机采样策略优化深度聚类模型,通过减少单次纪元需要处理的数据量以缩短其耗费的时间,致使在相同纪元数的条件下模型总的训练时间减少。后者则是从训练策略、损失函数、网络架构多个角度对深度聚类模型进行改进,企图让模型参数经历较少的更新次数就令其聚类结果达到预期。最终在MNIST、F-MNIST、CIFAR-10三个数据集验证所提出的两种改进算法,可以发现RSDK所耗费的训练时间会随着采样率下降而下降,而OTDK在MNIST数据集上可以让模型参数花费较少的更新次数就获得较高的聚类精度,虽然在另外两个数据集上获得的聚类精度还未能处于非常优越的水准,但与RSDK相比无明显差异,且模型具有收敛较快的优点。Deep clustering, by combining deep learning and traditional clustering methods, can effectively solve the problem of high-dimensional data clustering and has received widespread attention in the field of data processing. However, deep clustering models that require a large amount of computational resources often constrain their research and development, and even their applications. Therefore, this article explores methods to improve the training efficiency of deep clustering models by reducing the single iteration time and shortening the number of iterations required to achieve the desired accuracy. Two methods are proposed: Deep K-means based on Random Sampling Strategy (RSDK) and Two Stage Deep K-means based on Orthogonal Transform Features (OTDK). The former optimizes the deep clustering model using a random sampling strategy by reducing the amount of data that needs to be processed in a single epoch to shorten its training time, resulting in a reduction in the total training time of the model under the same epoch conditions. The latter improves the deep clustering model from multiple perspectives, such as training strategy, loss function, and network architecture, attempting to achieve the expected clustering results with fewer updates to the model parameters. The two improved algorithms proposed were ultimately validated on three datasets: MNIST, F-MNIST, and CIFAR-10. It was found that the training time consumed by RSDK decreased with decreasing sampling rate, while OTDK achieved higher clustering accuracy with fewer updates of model parameters on the MNIST dataset. Although the clustering accuracy obtained on the other two datasets was not yet at a very superior level, there was no significant difference compared to RSDK, and the model had the advantage of faster convergence.展开更多
文摘深度聚类通过联合深度学习和传统的聚类方法,可以有效解决高维数据聚类问题,在数据处理领域受到广泛关注,然而,需要花费大量计算资源的深度聚类模型往往会制约其研究发展乃至应用。因此,本文针对深度聚类模型训练耗费时间过长的问题,从减少单次迭代时间和缩短达到期望精度的迭代次数两个思路去探索提高模型训练效率的方法,分别提出了基于随机采样策略的深度K-means (Deep K-means based on Random Sampling Strategy, RSDK)和基于正交变换特征的二阶段深度K-means (Two Stage Deep K-means based on Orthogonal Transform Features, OTDK),前者利用随机采样策略优化深度聚类模型,通过减少单次纪元需要处理的数据量以缩短其耗费的时间,致使在相同纪元数的条件下模型总的训练时间减少。后者则是从训练策略、损失函数、网络架构多个角度对深度聚类模型进行改进,企图让模型参数经历较少的更新次数就令其聚类结果达到预期。最终在MNIST、F-MNIST、CIFAR-10三个数据集验证所提出的两种改进算法,可以发现RSDK所耗费的训练时间会随着采样率下降而下降,而OTDK在MNIST数据集上可以让模型参数花费较少的更新次数就获得较高的聚类精度,虽然在另外两个数据集上获得的聚类精度还未能处于非常优越的水准,但与RSDK相比无明显差异,且模型具有收敛较快的优点。Deep clustering, by combining deep learning and traditional clustering methods, can effectively solve the problem of high-dimensional data clustering and has received widespread attention in the field of data processing. However, deep clustering models that require a large amount of computational resources often constrain their research and development, and even their applications. Therefore, this article explores methods to improve the training efficiency of deep clustering models by reducing the single iteration time and shortening the number of iterations required to achieve the desired accuracy. Two methods are proposed: Deep K-means based on Random Sampling Strategy (RSDK) and Two Stage Deep K-means based on Orthogonal Transform Features (OTDK). The former optimizes the deep clustering model using a random sampling strategy by reducing the amount of data that needs to be processed in a single epoch to shorten its training time, resulting in a reduction in the total training time of the model under the same epoch conditions. The latter improves the deep clustering model from multiple perspectives, such as training strategy, loss function, and network architecture, attempting to achieve the expected clustering results with fewer updates to the model parameters. The two improved algorithms proposed were ultimately validated on three datasets: MNIST, F-MNIST, and CIFAR-10. It was found that the training time consumed by RSDK decreased with decreasing sampling rate, while OTDK achieved higher clustering accuracy with fewer updates of model parameters on the MNIST dataset. Although the clustering accuracy obtained on the other two datasets was not yet at a very superior level, there was no significant difference compared to RSDK, and the model had the advantage of faster convergence.