期刊文献+

分布式变量选择-MCP正则化 被引量:3

Distributed Variable Selection--MCP Regularization
下载PDF
导出
摘要 随着数字化时代的发展,各个学科和领域都会遇到海量高维数据.面对收集到的大量数据,如何将其转化为可存储、便分析、能为解决实际问题提供参考的材料成为当前所面临的一个巨大挑战.针对数据存储的现状,分布式存储方式应运而生.分布式存储是将数据集按照某种方式不重复的存储在不同的机器中,以此解决数据存储问题.那么,如何设计和研究出适合于分布式数据存储方式的机器学习算法便成为另一个亟待解决的问题.伴随着信息技术理论的发展,正则化方法的提出和发展为我们处理和分析海量高维数据提供了有效工具,但其仅适合于单机数据处理.鉴于非凸正则化对变量选择和特征提取的优越性,我们将分布式存储与非凸正则化方法相结合,关注基于分布式计算的非凸正则化方法,以此解决海量高维数据的存储和分析问题.本文采用分布式数据存储的形式研究变量选择问题.我们将数据分开存储于可互相通信的多个计算机,并提出分布式MCP方法,基于ADMM算法实现相邻计算机之间交互信息的分布式MCP算法,完成全数据的变量选择,并给出分布式MCP算法的收敛性分析.分布式方法的变量选择结果与非分布式方法变量选择结果相同.最后,通过实验证明本文所提出的方法适合于处理分布式存储数据. With the development of the digital age,a large number of high-dimensional data has been collected in various disciplines and fields.Faced with the huge amount of collected data,it becomes a great challenge for us to transform it into a form that can not only be stored and analyzed,but also can provide a reference for solving practical problems.In view of the current state of data storage,the distributed storage has emerged properly,in which data are stored in different machines in a certain way without any repetition,so as to solve the problem of data storage.Then,how to design a machine learning algorithm which is suitable for distributed data storage becomes another problem to be solved.As the theory of information technology has developed rapidly,the formulation and development of regularization methods provide us with an effective tool for processing and analyzing massive high-dimensional data,but they are only suitable for single-machine data processing.Concerning the superiority of non-convex regularization for variable selection and feature extraction,we combine distributed storage with non-convex regularization methods.We focus on non-convex regularization methods based on distributed computing to solve the storage and analysis of massive high-dimensional data.This paper studies the variable selection problem in the form of distributed data storage.We store the data separately in multiple computers that can communicate with each other,and propose a distributed MCP method.The distributed MCP algorithm implements interactive information between adjacent computers based on the ADMM algorithm,completes variable selection of full data,and ensures the convergence.The variable selection result of the distributed method is the same as that of the non-distributed method.Finally,the experimental results show that the proposed method is suitable for processing distributed storage data.
作者 王格华 王璞玉 张海 WANG Ge-hua;WANG Pu-yu;ZHANG Hai(School of Mathematics,Northwest University,Xi'an 710069)
出处 《工程数学学报》 CSCD 北大核心 2021年第3期301-314,共14页 Chinese Journal of Engineering Mathematics
基金 国家自然科学基金(11571011).
关键词 分布式 稀疏 MCP ADMM distributed sparse MCP ADMM
  • 相关文献

参考文献1

二级参考文献24

  • 1Akaike H. Information theory and an extension of the maximum likelihood principle. In: Petrov B N, Caki F, eds. Second International Symposium on Information Theory. Budapest: Akademiai Kiado, 1973. 267 -281.
  • 2Schwarz G. Estimating the dimension of a model. Ann Statist, 1978, 6:461-464.
  • 3Mallows C L. Some comments on Cp. Technometrics, 1973, 15:661-675.
  • 4Tibshirani R. Regression shrinkage and selection via the Lasso. J Royal Statist Soc B, 1996, 58:267-288.
  • 5Donoho D L, Huo X. Uncertainty principles and ideal atomic decomposition. IEEE Trans Inf Theory, 2001, 47:2845-2862.
  • 6Donoho D L, Elad E. Maximal sparsity representation via l1 minimization. Proc Natl Acal Sci, 2003, 100:2197-2202.
  • 7Chen S, Donoho D L, Saunders M. Atomic decomposition by basis pursuit. SIAM Rev, 2001, 43:129-159.
  • 8Fan J, Heng P. Nonconcave penalty likelihood with a diverging number of parameters. Ann Statist, 2004 32:928-961.
  • 9Zou H. The adaptive Lasso and its oracle properties. J Amer Statist Assoc, 2006, 101:1418-1429.
  • 10Zou H, Hastie T. Regularization and variable selection via the elastic net. J Royal Statist Soc B, 2005, 67:301-320.

共引文献36

同被引文献17

引证文献3

二级引证文献2

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部