摘要
不平衡数据在信用评估、财务造假、医疗诊断等现实应用中广泛存在。在众多处理不平衡数据的算法中,SMOTE算法(synthetic minority over-sampling technique)应用最为广泛。为解决SMOTE算法在生成数据时会弱化数据的真实分布,同时考虑到本福特法则(Benford’s Law)在处理自然数据中可以弥补数据弱化这一特点,将SMOTE算法与Benford’s Law相结合,提出一种新的处理类别不平衡数据的算法(BL-SMOTE算法),以提高数据分布真实性和准确性。实验结果表明,BL-SMOTE算法分类效果优于SMOTE算法。同时,相比于逻辑回归、决策树、梯度提升树等,随机森林的效果更优。
Unbalanced data is widely used in practical applications such as credit assessment, financial fraud and medical diagnosis. Among many algorithms dealing with unbalanced data, the synthetic minority over-sampling technique is the most widely used. Considering that Benford ’s Law can compensate for the weakening of data in the processing of natural data,which can repair the defects of SMOTE algorithm when generating data, the SMOTE algorithm is combined with Benford ’s Law. A new algorithm for dealing with unbalanced data (BL-SMOTE algorithm) is proposed to improve the authenticity and accuracy of data distribution. It is verified by experiments that the classification effect of BL-SMOTE algorithm is better than SMOTE algorithm. At the same time, random forests have better effects than logistic regression, decision tree, and gradient boosted decision tree.
作者
张宸宁
李国成
ZHANG Chenning;LI Guocheng(School of Applied Sciences, Beijing Information Science & Technology University,Beijing 100192,China)
出处
《北京信息科技大学学报(自然科学版)》
2019年第2期23-28,共6页
Journal of Beijing Information Science and Technology University
基金
国家自然科学基金资助项目(61473325)