摘要
目的:基于深度学习方法建立更加稳定、可靠、高实用性的化合物成药性预测模型。方法:通过Integrity, Chembl和DrugBank这3个数据库收集正、负样本数据,对正负样本大数据集进行数据清洗、解决数据不平衡问题之后,进一步对化合物的简化分子线性输入规范(SMILES)码进行标准化编码,在此基础上基于堆叠自编码神经网络算法(Stacked AutoEncoder, SAE)以及全连接神经网络算法(Fully Connected Neural Network, FCNN)构建并训练深度神经网络模型,对化合物进行特征提取,预测化合物的成药性。结果:模型最终稳定收敛,在验证集上准确率(ACC)和曲线下面积(AUC)分别达到0.995 3和0.992 7,较之前文献报道的基于机器学习的模型提高了约3%的预测精度。结论:基于大数据集和深度神经网络技术构建的化合物成药性预测模型具备一定的实用性,可以提高化合物成药性预测的精准度。
Objective: To build a more stable, reliable and practical model for the probability prediction of a lead compound becoming a drug based on the deep learning method. Methods: The positive and negative sample data sets were collected from Iintegrity, Chembl and Drugbank databases firstly. After cleaning the large data set of positive and negative samples and solving the problem of data imbalance, the compounds’ SMILES were further encoded. Then, Stacked AutoEncoder(SAE) and Fully Connected Neural Network(FCNN) were used to construct and train the deep neural network model to extract the features of the compounds and predict the probability of a lead compound becoming a drug. Results: The model finally converged stably, the ACC value and AUC value reached 0.995 3 and 0.992 9 respectively on the validation set, which improved the prediction accuracy by about 3% compared with the previously reported model based on machine learning. Conclusion: The prediction model based on large data set and deep neural network technology has certain practicability, and can improve the accuracy of the probability prediction of a lead compound becoming a drug.
作者
潘蕾
倪冰苇
赵鸿萍
PAN Lei;NI Bing-wei;ZHAO Hong-ping(School of Science,China Pharmaceutical University,Nanjing 211198,China)
出处
《中国新药杂志》
CAS
CSCD
北大核心
2021年第14期1309-1315,共7页
Chinese Journal of New Drugs
基金
国家自然科学基金面上项目(81973512)
中国药科大学校级教学改革研究课题重点项目(3050050188)。
关键词
堆叠自编码神经网络
全连接神经网络
深度学习
SMILES码
成药性预测
Stacked AutoEncoder
Fully Connected Neural Network
deep learning
SMILES
probability prediction of a lead compound becoming a drug