摘要
药品质量关乎人民健康和国家命脉,随着社会经济的飞速发展对药品质量的快速、有效鉴别具有极其重要的作用。光谱分析技术具有较高的准确性、较快的分析速度且对样品不存在污染等突出优点,广泛应用在化工、石油以及医药等重要的领域。为了解决传统药品鉴别模型存在的鉴别精度低、鉴别速度不能满足实际需求且鉴别模型稳定性差的问题,采用光谱仪采集药品的近红外光谱数据达到对药品无污染鉴别的目的。结合随机森林和CatBoost对药品进行分类鉴别,以实现快速且准确的鉴别。首先采用随机森林(RF)对光谱仪采集的光谱数据进行有效特征波长的筛选,从而将药品光谱数据中的无关波长去除、筛选出最能表征样品属性的特征波长,然后以极限学习机(ELM)作为CatBoost的弱分类器分析筛选的特征波长对药品的属性鉴别。由于ELM仅只含有一个隐含层且无需多次迭代寻优保证了鉴别模型运行速度更快,CatBoost通过集成弱分类器以改善模型鉴别准确性。为对所提出的药品鉴别模型性能进行有效评估,采用随机抽取训练集的方式构造不同规模药品光谱数据并分别上进行独立实验且以10次运行结果的均值作为其最终结果,并通过与CatBoost、持向量机(SVM)、反向传播网络(BP)、ELM、波形叠加极限学习机(SWELM)和Boosting进行对比,进一步对模型的性能进行评估。从不同规模训练集的分类结果可看出,随着训练集样本的增加分类精度最高为100%且预测标准偏差趋于0。实验结果表明,所建立RF-CatBoost鉴别模型在不同规模的药品数据集上较对比方法具有更高的分类准确率、更快的速度且其鲁棒性更强,能够广泛应用于药品类别的准确鉴别,从而实现药品质量的有效监督。
Drug quality is related to people’s health and national lifeblood.The rapid development of the economy and society plays an extremely important role in the rapid and effective identification of drug quality.Spectral analysis technology has high accuracy,fast analysis speed and no pollution to samples,and is widely used in the chemical industry,petroleum,medicine and other important areas of people’s livelihood.In order to solve the problems of low accuracy,low identification speed and poor stability of the traditional drug identification model,the spectrometer was used to collect near-infrared spectroscopy data of drugs to achieve the purpose of pollution-free drugs.Then,random forest and CatBoost were combined to classify and identify drugs quickly and accurately.The proposed method firstly uses Random Forest(RF)to screen the effective characteristic wavelength of the spectrometer’s spectral data to eliminate the irrelevant wavelength in the drug spectral data and screen out the characteristic wavelength that can best characterize the sample properties.Then Extreme Learning Machine(ELM)was used as CatBoost weak classifier to analyze the feature wavelengths of the screening for drug attribute identification.Since ELM only contains one hidden layer and no iterative optimization is required to ensure the faster running of the identification model,CatBoost can improve the model’s identification accuracy by integrating a weak classifier.In order to effectively evaluate the performance of the drug identification model proposed in this paper,the spectral data of drugs of different sizes were constructed by randomly selected training sets,and experiments were carried out independently.The mean value of 10 running results was taken as the final result.In addition,Back Propagation with CatBoost,Support Vector Machine(SVM),BP,ELM,Summation Wavelet Extreme Learning Machine(SWELM)and Boosting were compared to evaluate the performance of the proposed model further.As can be seen from the classification results of training sets of different sizes,with the increase of training sets,the highest classification accuracy is 100%,and the prediction standard deviation tends to be 0.The experimental results show that the RF-CATBoost identification model proposed in this paper has higher classification accuracy,faster speed and stronger robustness than the comparison method on drug data sets of different sizes and can be widely used in the accurate identification of drug categories,to achieve effective supervision of drug quality.
作者
蒋萍
路皓翔
刘振丙
JIANG Ping;LU Hao-xiang;LIU Zhen-bing(School of Computer and Information Technology,Guangxi Police College,Nanning 530028,China;College of Computer and Information Security,Guilin University of Electronic Technology,Guilin 541004,China)
出处
《光谱学与光谱分析》
SCIE
EI
CAS
CSCD
北大核心
2022年第7期2148-2155,共8页
Spectroscopy and Spectral Analysis
基金
国家自然科学基金项目(61866009)
广西重点研发计划项目(桂科AB22035034)
广西警察学院校级科研课题(2021KYA01)资助。