期刊文献+

高维协变量混合型数据的异质性分析

Heterogeneity Analysis of High-Dimensional Data with Mixed Types of Covariates
原文传递
导出
摘要 大数据时代下,具有混合类型协变量的高维调查数据给异质性分析及其变量选择带来了挑战.文章提出了一种改进的稀疏聚类方法,并以中国教育追踪调查和“千人百村”社会调查为例展开应用讨论.文章提出了调整后DBI准则用以衡量协变量重要程度,使用不同惩罚参数分别控制不同类型协变量的权重,得出最优的聚类划分结果以及重要的类别区分协变量.理论层面,文章证明了所提出方法的变量筛选一致性.数值实验层面,文章设计了一系列模拟实验验证了所提出方法在聚类和变量选择方面的良好性能.实证数据的结果也表明,文章提出的稀疏聚类方法所划分出的样本集群具有较高的区分度,便于研究者对每个群体进行特征的刻画;同时,其选择出的类别区分变量具有重要现实意义,在不损失重要信息的条件下降低了数据的维度,增加了模型的可解释性.文章提出的稀疏聚类分析实现了对高维调查数据中的混合类型协变量的联合分析,极大化地提升了信息的使用率和数据的利用率. In the era of big data,high-dimensional survey data with mixed types of covariates brings challenges to heterogeneity analysis and its variable selection.This paper proposes a novel sparse clustering method,and discusses its application by taking the China Education Panel Survey and the social survey of"Thousands of People and Hundreds of Villages"as examples.This paper proposes an adjusted DBI criterion to measure the importance of covariates,uses different penalty parameters to control the weights of different types of covariates,and obtains the optimal clustering results and significant covariates.At the theoretical level,this paper demonstrates the variable screening consistency of the proposed method.At the numerical experiment level,a series of simulation experiments are designed in this paper to verify the good performance of the proposed method in terms of clustering and variable selection.The results of empirical data also show that the clusters divided by the proposed method have a high degree of discrimination,which is convenient for researchers to characterize each group;At the same time,the selected variables have important practical meanings.Without losing information,the dimensionality of the data is reduced,and the interpretability of the model is increased.The sparse clustering analysis proposed in this paper realizes the joint analysis of mixed types of covariates in high-dimensional survey data,which greatly improves the utilization rate of information.
作者 徐少东 李扬 边策 XU Shaodong;LI Yang;BIAN Ce(Center for Applied Statistics,Renmin University of China,Beijing 100872;School of Statistics,Renmin University of China,Beijing 100872;School of Science,Renmin University of China,Beijing 100872)
出处 《系统科学与数学》 CSCD 北大核心 2024年第8期2429-2457,共29页 Journal of Systems Science and Mathematical Sciences
基金 中国人民大学科学研究基金(中央高校基本科研业务费专项资金资助)项目成果(18XNE018)资助课题
关键词 异质性分析 混合数据 高维数据 变量选择 Heterogeneity analysis mixed data high-dimensional variable selection
  • 相关文献

参考文献8

二级参考文献61

共引文献45

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部