摘要
数据不平衡问题的存在,使得模型倾向于将测试样本判别为多数类,导致少数类的分类效果较差。可以从数据和算法两个角度解决数据不平衡带来的问题,本研究主要关注关键因素筛选时不平衡问题的处理,在数据层面使用基于SMOTE抽样的Group Lasso,算法层面使用了调节阈值的Group Lasso,包括分步调节参数和同时调节参数两种方法。最后在307例亚健康患者的问卷数据上使用三种方法建立"肝郁脾虚"诊断模型。从得到的结果来看,基于SMOTE的方法和同时调参的方法得到模型预测效果在灵敏度和特异度上较好。
The existence of data imbalance makes a model tends to predict samples as majority class, resulting in a poor classification effect. The problem of data imbalance can be solved from two aspects of data and algorithm. This research mainly focused on processing imbalance problem in variable selection. In the aspect of data, Group Lasso logistic based on SMOTE sampling was used. In the aspect of algorithm, the Group Lasso with threshold adjusting which include adjusting the parameters step by step and adjusting the parameters simultaneously were used. Finally, the diagnosis model of"liver depression and spleen deficiency"in 307 sub-health patients. questionnaire data was established by three methods. The results showed that the method based on SMOTE and method of simultaneous parameter adjustment have a better prediction in accuracy and sensitivity.
作者
贾萍萍
李扬
Jia Pingping;Li Yang(Center for Applied Statistics of Renmin University of China,Beijing 100872,China;School of Statistics,Rerunin University of China,Beijing 100872,China)
出处
《世界科学技术-中医药现代化》
CSCD
北大核心
2019年第3期389-394,共6页
Modernization of Traditional Chinese Medicine and Materia Medica-World Science and Technology
基金
国家教育部人文社会科学重点研究基地重大项目(16JJD910002):基于大数据的精准医学生物统计分析方法及其应用研究,负责人:许王莉
国家自然科学基金委青年基金项目(11401013):基于函数型数据分析的联合统计建模:理论与应用,负责人:黄辉
中国人民大学2017年度中央高校建设世界一流大学(学科)和特色发展引导专项资金,负责人:赵彦云