摘要
在现实的很多信用评估问题中,由于对样本进行类别标记需要花费大量的人力、财力和物力,往往只能获取少量有类别标签的样本来训练分类模型,而把数据库中大量无类别标签的客户样本舍弃。为解决这一问题,本研究引入半监督学习技术,并将其与多分类器集成技术中的随机子空间方法(Random Subspace,RSS)相结合,构建了类别不平衡环境下基于RSS的半监督协同训练模型RSSCI。该模型主要包括三个阶段:1)使用RSS方法训练得到若干基本分类器;2)从大量无类别标签数据集中选择性标记一部分最合适的样本加入到原始训练集中;3)在最终的训练集上训练分类模型,并对测试集样本进行分类。在三个客户信用评估数据集上进行实证分析,结果表明,RSSCI模型的信用评估性能不仅优于常用的监督式集成信用评估模型,也优于已有的一些半监督协同训练信用评估模型。
Customer credit scoring is one of the most important issues in customer relationship management(CRM).In some real credit scoring issues,many customer samples without class labels are abandoned and just only a few samples with class labels can be used to train the classification models,because it costs a lot of manpower,financial and material resources for labeling the samples.Furthermore,single classification model is difficult to achieve the accurate classification of the whole sample space as the current customer credit scoring problem with class imbalance characteristic.To solve the two problems,semi-supervised learning is introduced and combined with random subspace(RSS)in multiple classifiers ensemble,and then RSS is proposed based semi-supervised co-training model for class imbalance,RSSCI.This model includes the following three phases:1)Obtains many base classifiers by RSS;2)Labels some most appropriate samples in U which obtains lots of samples without class labels.Firstly,3base classifiers with the best performance are selected to classify the samples in U,the samples with the same forecasted class are put into the candidate set,and then the label confidence of each sample is calculated.Considering the class imbalance of the training data,the candidate are divided set into the positive and negative subsets,and the samples with higher confidence are selected from the two subsets according to the ratio of two classes in the original training set and added the original training set;3)Trains the classification model in the final training set,and classifies the test set.Empirical analysis is conducted in three credit scoring datasets(German,Australia,UK-thomas,all of them are imbalanced data sets of a type distribution;moreover,German and Australia are from the UCI international public database),and the results show that the performance of RSSCI model is superior to the common used supervised ensemble credit scoring models and some existing semi-supervised CO-training credit scoring models,demonstrating the superiority of the RSSCI model of selective mechanism of labeling samples.In CRM,there are a lot of customer classification problems,such as customer churn prediction,customer targeting,which are similar to customer credit scoring.Thus,the model proposed in this study can also be used to solve the above problems,and thus is expected to achieve satisfaction classification performance.
出处
《中国管理科学》
CSSCI
北大核心
2016年第6期124-131,共8页
Chinese Journal of Management Science
基金
国家自然科学基金资助项目(71471124
71571126)
四川省青年基金(2015RZ0056)
四川省社科规划项目(SC14C019)
四川大学优秀青年基金项目(2013SCU04A08)
四川大学哲学社会科学青年学术人才基金(skqx201607)
四川省教育厅创新团队资助项目(13TD0040)