摘要
受到计算内存的限制,大规模数据的回归分析往往难以奏效。为此,借用“化整为零”的思想,提出了一个新的回归分析方法:分块SCAD惩罚回归。该方法核心在于:将大规模数据划分成若干个块,对每一个块进行SCAD惩罚回归,最后将每个块的参数估计结果进行简单平均作为全样本回归系数估计的近似。进一步,在理论上证明了分块SCAD惩罚回归的变量选择效果与渐近性质。数值模拟和实际应用结果表明:分块SCAD惩罚回归不仅能够显著降低计算内存的需求和计算时间,而且其变量选择、参数估计和预测结果等与全样本回归基本一致。
It is difficult to implement regression on large-scale data owing to limitations of computer primary memory. To this end, we borrow the idea of breaking up the whole into parts and propose a new regression method: Block and SCAD Penalty based Regression. The major novelty of this method includes: splitting the entire data into a few blocks, implementing the SCAD penalty regression on data in each block, deriving final results through combining these SCAD penalty regression results via simple average approach, which provides approximate estimates of the regression coefficients on entire dataset. Moreover, we demonstrate the performance of variable selection and asymptotic property of the proposed method theoretically. Both numerical simulations and a real-world application show that the proposed method significantly reduces the required amount of primary memory and computation time. In addition, the new method is as efficient as the regression on entire dataset in terms of variable selection, estimation, and prediction, etc.
作者
蔡超
许启发
蒋翠侠
王艳明
CAI Chao;XU Qi-fa;JIANG Cui-xia;WANG Yan-ming(School of Statistics,Shandong Technology and Business University,Shandong Yantai 264005,China;School of Management,Hefei University of Technology,Anhui Hefei 230009,China)
出处
《数理统计与管理》
CSSCI
北大核心
2018年第6期1023-1040,共18页
Journal of Applied Statistics and Management
基金
国家自然科学基金(71671056)
国家社会科学基金(14BTJ028,15BJY008)
教育部人文社会科学研究规划基金项目(14YJA790015)
山东省社会科学规划项目(18DTJJ01)支持