摘要
中医药物质基础实验数据往往呈现特征维数较高、样本较少的特点,且该数据还存在较多的无关信息和冗余信息,给深入挖掘中医药物质信息带来了挑战。提出基于最大信息系数和迭代式XGBoost的混合特征选择方法,利用最大信息系数度量每维特征与目标变量间的相关性,并且按照某种评价准则实现无关特征的过滤和候选特征子集的获取;将候选子集进行排序与划分,依次采用XGBoost方法迭代剔除冗余特征,从而得到有效特征子集。实验结果表明,该方法能够选出数量较少且解释性较强的特征,且对中医药物质基础实验数据有较好的适应性。
Traditional Chinese medicine(TCM) basic experiments data often show the characteristics of higher feature dimensions and fewer samples, and the data still has more irrelevant information and redundancy, which has brought challenges to digging deeper into the information of Chinese medicine substances. This paper proposes a hybrid feature selection method based on maximum information coefficient and iterative XGBoost. This method used the maximum information coefficient to measure the correlation between each dimension feature and the target variable, implemented filtering for irrelevant features according to some evaluation criteria and obtained feature subsets. The candidate subsets were sorted and divided, and the XGBoost method was used to iteratively remove redundant features in order to obtain effective feature subsets. The experimental results show that the new method can select a small number of features with strong interpretation, and it has good adaptability to the experimental data of the basic materials of TCM.
作者
熊玲珠
邱伟涵
罗计根
李科定
Xiong Lingzhu;Qiu Weihan;Luo Jigen;Li Keding(College of Computer Science,Jiangxi University of Chinese Medicine,Nanchang 330004,Jiangxi,China;South China Normal University,Guangzhou 510631,Guangdong,China;Xiamen Xian Yue Hospital,Xiamen 361012,Fujian,China)
出处
《计算机应用与软件》
北大核心
2023年第1期280-286,305,共8页
Computer Applications and Software
基金
国家自然科学基金项目(61363042,61562045,61762051)
江西省重点研发计划重点项目(20171ACE50021)
江西省科技厅科学技术研究项目(GJJ190683)
江西省研究生创新专项资金项目(YC2018-S281)。