摘要
大数据时代,收集到的数据维度越来越高,数据来源也越来越多。针对多源高维数据,本文提出了一种考虑数据源网络结构的多源高维数据整合分析方法,利用k近邻方法构建数据源间的网络结构,对于有网络连接的数据集的模型系数施加NetworkMCP惩罚来自动识别同质数据和异质数据,并利用MCP惩罚筛选每个数据集的重要变量,能同时进行各数据源的模型估计、变量选择以及数据源的子群识别。模拟实验表明,在不同的模拟设置下本文所提方法在变量选择、参数估计和分类预测准确率上都有良好的效果。最后,将该方法应用到房地产租赁评价数据上,利用经纬度位置信息构建数据源间的网络结构,可以很好地识别出房地产子市场,并在模型上具有更好的解释性。
In the era of big data, the dimensions of collected data are getting increasingly higher, with data sources diversified. Considering multi-source high-dimensional data, this paper proposes a new integrative analysis method using the K-nearest neighbor method to construct a network structure between data sources. It combines Network MCP penalty with separate MCP penalty to not only automatically identify homogeneous datasets and heterogeneous datasets, but also select the important variable sets of each dataset. In this way, our method can simultaneously conduct the model estimation, variable selection and subgroup identification of data sources. Simulation experiments show that the proposed method has a significant advantage in variable selection, parameter estimation and classification prediction accuracy under different settings. Finally, through experiments on real estate lease evaluation datasets which provide latitude and longitude location information for network construction, it is empirically shown that the proposed method can well identify the sub-markets of real estate and has better interpretability.
作者
方匡南
张晴雯
林洪伟
Fang Kuangnan;Zhang Qingwen;Lin Hongwei
出处
《统计研究》
CSSCI
北大核心
2022年第7期125-136,共12页
Statistical Research
基金
国家自然科学基金面上项目“基于多源信息融合的高维分类方法及其在信用评分中的应用”(72071169)
教育部人文社会科学研究青年基金“基于半监督学习的消费金融风控方法与应用研究”(20YJC910004)
国家社会科学基金重大项目“国家治理能力现代化的测度理论、方法与进展评价研究”(21&ZD146)。
关键词
多源高维数据
整合分析
网络结构
子群识别
Multi-socure High-dimensional Data
Integrative Analysis
Network Structure
Subgroup Identification