期刊文献+

考虑数据源网络结构的高维数据整合分析与子群识别研究 被引量:1

High-dimensional Data Integrative Analysis and Subgroup Identification Incorporating Data Source Network Structure
下载PDF
导出
摘要 大数据时代,收集到的数据维度越来越高,数据来源也越来越多。针对多源高维数据,本文提出了一种考虑数据源网络结构的多源高维数据整合分析方法,利用k近邻方法构建数据源间的网络结构,对于有网络连接的数据集的模型系数施加NetworkMCP惩罚来自动识别同质数据和异质数据,并利用MCP惩罚筛选每个数据集的重要变量,能同时进行各数据源的模型估计、变量选择以及数据源的子群识别。模拟实验表明,在不同的模拟设置下本文所提方法在变量选择、参数估计和分类预测准确率上都有良好的效果。最后,将该方法应用到房地产租赁评价数据上,利用经纬度位置信息构建数据源间的网络结构,可以很好地识别出房地产子市场,并在模型上具有更好的解释性。 In the era of big data, the dimensions of collected data are getting increasingly higher, with data sources diversified. Considering multi-source high-dimensional data, this paper proposes a new integrative analysis method using the K-nearest neighbor method to construct a network structure between data sources. It combines Network MCP penalty with separate MCP penalty to not only automatically identify homogeneous datasets and heterogeneous datasets, but also select the important variable sets of each dataset. In this way, our method can simultaneously conduct the model estimation, variable selection and subgroup identification of data sources. Simulation experiments show that the proposed method has a significant advantage in variable selection, parameter estimation and classification prediction accuracy under different settings. Finally, through experiments on real estate lease evaluation datasets which provide latitude and longitude location information for network construction, it is empirically shown that the proposed method can well identify the sub-markets of real estate and has better interpretability.
作者 方匡南 张晴雯 林洪伟 Fang Kuangnan;Zhang Qingwen;Lin Hongwei
出处 《统计研究》 CSSCI 北大核心 2022年第7期125-136,共12页 Statistical Research
基金 国家自然科学基金面上项目“基于多源信息融合的高维分类方法及其在信用评分中的应用”(72071169) 教育部人文社会科学研究青年基金“基于半监督学习的消费金融风控方法与应用研究”(20YJC910004) 国家社会科学基金重大项目“国家治理能力现代化的测度理论、方法与进展评价研究”(21&ZD146)。
关键词 多源高维数据 整合分析 网络结构 子群识别 Multi-socure High-dimensional Data Integrative Analysis Network Structure Subgroup Identification
  • 相关文献

参考文献2

二级参考文献24

  • 1李志辉,李萌.我国商业银行信用风险识别模型及其实证研究[J].经济科学,2005(5):61-71. 被引量:33
  • 2Fan J, Han F, Liu H. Challenges of Big Data analysis [J] National Science Review, 2014, 1 (2) :293 -314.
  • 3Yuan M, Lin Y. Model selection and estimation in regression with grouped variables [ J ]. Journal of the Royal Statistical Society: Series B, 2006, 68:49 -67.
  • 4Simon N, Friedman J, Hastie T and Tibshirani R. A sparse Group lasso [ J]. Journal of Computational and Graphical Statistics, 2013, 22(2) :231 -245.
  • 5Huang J, Ma S, Xie H and Zhang C. -H. A group bridge approach for variable selection [ J]. Biometrika, 2009, 96:339 - 355.
  • 6Ma S, Huang J, Song X. Integrative analysis and variable selection with multiple high-dimensional data sets [ J]. Biostatistics, 2011 a, 12(4) : 763 -775.
  • 7Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties [ J]. Journal of the American Statistical Association, 2001, 96 : 1348 - 1360.
  • 8Ma S, Dai Y, Huang J and Xie Y. Identification of breast cancer prognosis markers via integrative analysis [ J ]. Computational statistics and data analysis, 2012, 56 (9) : 2718 - 2728.
  • 9Huang J, Wei F, Ma S. Consistent group selection and estimation via normed minimax concave penalty, 2010. Unpublished manuscript.
  • 10Huang J, Breheny P, Ma S. A selective review of group selection in high-dimensional models [J]. Statistical Science, 2012, 27(4): 481 - 499.

共引文献37

同被引文献13

引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部