摘要
在海量征信数据的背景下,为降低缺失数据插补的计算成本,提出收缩近邻插补方法.收缩近邻方法通过三阶段完成数据插补,第一阶段基于样本和变量的缺失比例计算入样概率,通过不等概抽样完成数据的收缩,第二阶段基于样本间距离,选取与缺失样本近邻的样本组成训练集,第三阶段建立随机森林模型进行迭代插补.利用Australian数据集和中国各银行数据集进行模拟研究,结果表明在确保一定插补精度的情况下,收缩近邻方法较大程度减少了计算量.
Massive credit data with large amount of samples and high dimensions pose serious problems of computational efficiency. This paper proposes a new missing data im- putation method ,called compress and proximity to tackle the problem. This method first compress the data through unequal probability sampling based on the proportion of missing data of samples and variables ,then select the samples which proximity to incomplete samples to compose training data based on distance, last built the Random forest model to interpo- late missing data by iterative. Australian credit scoring datasets and Chinese banks credit scoring datasets were selected for our simulation. Results show that our method reduced the computational load without decreasing too much accuracy of imputation.
出处
《数学的实践与认识》
北大核心
2017年第8期147-153,共7页
Mathematics in Practice and Theory
基金
教育部人文社会科学重点研究基地重大项目(15JJD910002)