期刊文献+

基于随机森林特征重要性和区间偏最小二乘法的近红外光谱波长筛选方法 被引量:2

Wavelength Selection Method of Near-Infrared Spectrum Based onRandom Forest Feature Importance and Interval PartialLeast Square Method
下载PDF
导出
摘要 为建立快速近红外光谱定量分析模型,特征波长筛选是提高定量分析预测精度较为有效的方法之一。它能够筛选出有效波长信息,减少数据冗余、提高数据有效性。随机森林(RF)作为一种集成算法,可根据计算特征重要性进行特征筛选。RF将基于袋外数据(OOB)的平均精度下降(MDA)方法计算均方误差平均值作为特征重要性结果,通过设置特征重要性阈值筛选特征变量构成特征波长子集,但该阈值范围的设定无理论依据,因此需要对特征重要性阈值范围进行探究。另一方面,由于RF的随机特性,特征波长子集中可能包含无效甚至是干扰变量,并不能保证所选变量的有效性。故而进一步提出RF-iPLS波长筛选方法。区间偏最小二乘法(iPLS)筛选出的特征波长多为连续特征波段的特性,对特征波长子集划分区间,弥补RF因自身随机性造成的无效变量问题;同时,RF筛选的离散特征波长解决了iPLS筛选的连续波段中含冗余信息的问题。为了说明RF-iPLS算法的合理性,特征子集经过蒙特卡洛(MC)方法500次样本特征采样后,构建RF-MC-iPLS算法。虽然RF-iPLS与RF-MC-iPLS算法结构接近,但运行时间缩短了11.12%,结果说明RF-iPLS算法在预测模型中的特征波长筛选是有效的,且具有较低的时间复杂度。为了进一步验证改进的RF-iPLS算法的有效性,应用一组公开谷物蛋白质近红外光谱数据,建立PLSR模型,并与全谱的PLSR模型以及基于不同波长筛选方法的PLSR模型进行比较。实验结果表明,相比于全谱的117个波长,RF-iPLS优选出12个特征波长,建模集的RMSEC从2.61降到0.64,预测精度提升了约75.5%,预测集的RMSEP从2.63降到0.69,预测精度提升了73.8%,极大地提高了预测精度且预测结果最优,说明RF-iPLS是一种有效的特征波长筛选方法,可以简化近红外光谱定量分析模型的复杂度并实现高效降维。 In the rapidly establishing quantitative analysis model of near-infrared spectroscopy,feature wavelength selection is one of the more effective methods to improve prediction accuracy.Through selecting effective information,redundant data is reduced,and the effectiveness of the data set is improved.Random Forest(RF)is an integrated algorithm.The feature importance of spectroscopy wavelength can be calculated by using RF.And the mean square error average value is used as the feature importance result based on the mean decrease accuracy(MDA)method of Out-of-Bag data(OOB).The feature variables are selected to form the feature wave subset by setting the feature importance threshold.However,there is no theoretical basis for setting the threshold range.So it is necessary to explore the range of feature importance thresholds.On the other hand,due to the random characteristics of RF,invalid or even interfering variables may be included in the characteristic wavelength subset,and the selected effectiveness variables cannot be guaranteed.Therefore,the RF-iPLS feature wavelength selection algorithm is further proposed.The feature wavelength subset is divided into intervals by interval partial least squares(iPLS),which makes up for the problem of invalid variables caused by RF randomness and redundant information by iPLS.In order to illustrate the rationality of the RF-iPLS algorithm,the RF-MC-iPLS algorithm is constructed using by Monte Carlo(MC)method.The comparison feature subset is generated after 500samples.Although the structure of RF-iPLS is similar to that of RF-MC-iPLS,its running time is shortened by 11.12%.The results show that the feature wavelength selection of the RF-iPLS algorithm is effective and has low time complexity in the prediction model.Furthermore,to verify the algorithm’s effectiveness,RF-iPLS was applied to grain protein near-infrared spectroscopy data sets and PLSR models were established.It is compared with the full spectrum PLSR and PLSR models based on different wavelength selection methods.The results show that compared with 117 wavelength points of the full spectrum,RF-iPLS selects 12feature wavelength points.The RMSEC of the modeling set is reduced from 2.61to 0.64.The prediction accuracy is improved by about 75.5%.The RMSEP of the prediction set is reduced from 2.63to 0.69,and the prediction accuracy is improved by 73.8%.The prediction accuracy and optimal prediction results show that RF-iPLS is an effective feature wavelength selection method,and it can simplify the complexity of the near-infrared spectral quantitative analysis model and achieve efficient dimensionality reduction.
作者 陈蕊 王雪 王子文 曲浩 马铁民 陈争光 高睿 CHEN Rui;WANG Xue;WANG Zi-wen;QU Hao;MA Tie-min;CHEN Zheng-guang;GAO Rui(College of Information and Electrical Engineering,Heilongjiang Bayi Agricultural University,Daqing 163319,China;Daqing Center of Inspection and Testing for Agricultural Products and Processed Products,Ministry of Agriculture and Rural Affairs,Daqing 163319,China;School of Electrical and Information,Northeast Agricultural University,Harbin 150030,China)
出处 《光谱学与光谱分析》 SCIE EI CAS CSCD 北大核心 2023年第4期1043-1050,共8页 Spectroscopy and Spectral Analysis
基金 黑龙江省“百千万”工程科技重大专项(2019ZX14A0401) 中央支持地方高校改革发展资金项目(2020GSP15) 黑龙江省博士后面上项目(LBH-Z19217) 黑龙江八一农垦大学三横三纵支持计划项目(ZRCQC201907) 黑龙江八一农垦大学学成人才科研启动基金项目(XDB202004)资助。
关键词 波长筛选 特征重要性计算 谷物蛋白质含量 定量分析 Wavelength selection Feature importance calculation Grain protein content Quantitative analysis
  • 相关文献

参考文献14

二级参考文献155

共引文献886

同被引文献31

引证文献2

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部