摘要
本文运用CCF竞赛提供的中原银行个人信用贷款违约数据,进行了数据清洗和特征工程的工作,从初始的38个特征缩减到18个特征,结合5C理论和预期收入理论探究了影响银行个人信用风险的重要因素,经过特征重要性排序排名前五的因素是:信贷周转余额合计、贷款发放日期据初始日期天数、借款人贷款评分平均分、当前贷款利率和匿名变量f0。为提升银行对个人信用风险评估的准确率,本文基于随机森林模型比较了SMOTE、随机欠采样和SMOTEENN三种非平衡数据的处理方法进行实验,SMOTEENN组合采样的效果最好;然后建立了决策树、随机森林、AdaBoost和LightGBM共4个机器学习模型,结果表明平衡后LightGBM的准确率最高,达到了96.1%。
In this paper, using the personal credit loan default data of Zhongyuan Bank provided by the CCF competition, the data cleaning and feature engineering was carried out and the initial 38 features were reduced to 18 features. Then the important factors affecting the bank personal credit risk were explored by combining the 5C theory and expected income theory, and the top five factors ranked by feature importance were: total credit working balance, loan disbursement date accord-ing to the initial date days, borrower’s average loan score, current loan interest rate and anonymous variable f0. In order to improve the accuracy of bank personal credit risk assessment, this paper compared three methods of processing unbalanced data, SMOTE, random under sampling and SMOTEENN, based on the random forest model, and SMOTEENN combined sampling had the best effect;then a total of four machine learning models, decision tree, random forest, AdaBoost and LightGBM, were established and it’s showed that LightGBM had the highest accuracy rate after bal-ancing, reaching 96.1%.
出处
《建模与仿真》
2023年第4期3747-3755,共9页
Modeling and Simulation