摘要
研究响应变量两类比例不平衡时逻辑斯蒂回归的最优参数估计和代价敏感分类问题.在代价敏感的损失函数下,将不平衡的两类数量之比作为参数,通过等价转换成一个重新加权的类别平衡分类问题,得到了原问题预测的超额风险(excess risk)的上界和逻辑斯蒂回归系数误差上界.同时利用VC维技术得到了正则条件下超额风险的minimax下界.得出结论:在相差一个可忽略常数倍的意义下,非平衡数据在代价敏感损失下得到的惩罚似然估计的误差界可以达到最优,且最优误差与一个可收敛至零的稀有类比例有关.其次,论文还将主要结论推广至损失函数为非凸的情形,并讨论了在两类数量比例需要估计时的误差上界.此外,通过数值模拟比较了给定和待估计类别比例的实际表现,发现主要结论不受影响.
This paper studies the optimal parameter estimation and cost sensitive risk for high dimensional logistic regression with class-imbalanced data.Under the cost-sensitive loss framework and including the imbalance-class ratio as a parameter,a re-weighted balanced-class data is rebuilt and it is equivalent to the original problem.We get the upper bound for the excess risk of the prediction and estimation error rate for the logistic regression coefficients.Also,the minimax lower bound for the excess risk is obtained using the technique based on VC-dimension and with proper regulatory conditionsH.ence we reach the conclusion that the penalized logistic regression with cost-sensitive loss can obtain the optimal rate in high dimensional imbalanced data.And the error rate is related to the rare-class ratio which is allowed to converging to zero.Then the conclusion is generalized to the case of non-convex loss function scenario.The upper bound for the error rate when the two-class ratio is plug-in estimated is also obtained.Through the numerical simulation,it can be observed that the conclusion is not affected by the plug-in process.
作者
李智凡
尹建鑫
LI Zhifan;YIN Jianxin(Center for Applied Statistics and School of Statistics,Renmin University of China,Beijing 100872)
出处
《系统科学与数学》
CSCD
北大核心
2023年第9期2341-2363,共23页
Journal of Systems Science and Mathematical Sciences
基金
国家重点研发计划(2020YFC2004900)资助课题。