为解决临床医学量表数据类别不均衡容易对模型产生影响,以及在处理量表数据任务时深度学习框架性能难以媲美传统机器学习方法问题,提出了一种基于级联欠采样的Transformer网络模型(layer by layer Transformer,LLT)。LLT通过级联欠采样...为解决临床医学量表数据类别不均衡容易对模型产生影响,以及在处理量表数据任务时深度学习框架性能难以媲美传统机器学习方法问题,提出了一种基于级联欠采样的Transformer网络模型(layer by layer Transformer,LLT)。LLT通过级联欠采样方法对多数类数据逐层删减,实现数据类别平衡,降低数据类别不均衡对分类器的影响,并利用注意力机制对输入数据的特征进行相关性评估实现特征选择,细化特征提取能力,改善模型性能。采用类风湿关节炎(RA)数据作为测试样本,实验证明,在不改变样本分布的情况下,提出的级联欠采样方法对少数类别的识别率增加了6.1%,与常用的NEARMISS和ADASYN相比,分别高出1.4%和10.4%;LLT在RA量表数据的准确率和F 1-score指标上达到了72.6%和71.5%,AUC值为0.89,mAP值为0.79,性能超过目前RF、XGBoost和GBDT等主流量表数据分类模型。最后对模型过程进行可视化,分析了影响RA的特征,对RA临床诊断具有较好的指导意义。展开更多
The human ether-a-go-go related gene (hERG) channel is responsible for the repolarization during the action potential, and blockage of that may result in severe cardiotoxicity and sudden death. In this study, a data...The human ether-a-go-go related gene (hERG) channel is responsible for the repolarization during the action potential, and blockage of that may result in severe cardiotoxicity and sudden death. In this study, a dataset containing 1969 compounds was compiled from literature and FDA-approved drugs. Using a support vector machine (SVM), two groups of computational models were built to distinguish whether a compound is a blocker or non-blocker of hERG potassium ion channel. These mod- els fit generally satisfactory. The 100 models built with MACCS fingerprints (Model Group A) showed an average accuracy of 90% and an average matthews correlation coefficient (MCC) value of 0.77 on the test sets. The 100 models built with selected MOE descriptors (Model Group B) showed an average accuracy of 89% and an average MCC value of 0.74 on the test sets. Molecular hydrophobicity and lipophilicity were found to be very important factors which lead to block the hERG potassium ion channel. Some other molecular properties such as electrostatic properties, features based on van der Waals surface area, the number of rigid bonds and molecular surface rugosity also played important roles in blocking bERG potassium ion channel.展开更多
文摘为解决临床医学量表数据类别不均衡容易对模型产生影响,以及在处理量表数据任务时深度学习框架性能难以媲美传统机器学习方法问题,提出了一种基于级联欠采样的Transformer网络模型(layer by layer Transformer,LLT)。LLT通过级联欠采样方法对多数类数据逐层删减,实现数据类别平衡,降低数据类别不均衡对分类器的影响,并利用注意力机制对输入数据的特征进行相关性评估实现特征选择,细化特征提取能力,改善模型性能。采用类风湿关节炎(RA)数据作为测试样本,实验证明,在不改变样本分布的情况下,提出的级联欠采样方法对少数类别的识别率增加了6.1%,与常用的NEARMISS和ADASYN相比,分别高出1.4%和10.4%;LLT在RA量表数据的准确率和F 1-score指标上达到了72.6%和71.5%,AUC值为0.89,mAP值为0.79,性能超过目前RF、XGBoost和GBDT等主流量表数据分类模型。最后对模型过程进行可视化,分析了影响RA的特征,对RA临床诊断具有较好的指导意义。
基金supported by the National Natural Science Foundation of China(20975011)"Chemical Grid Project"of Beijing University of Chemical Technology
文摘The human ether-a-go-go related gene (hERG) channel is responsible for the repolarization during the action potential, and blockage of that may result in severe cardiotoxicity and sudden death. In this study, a dataset containing 1969 compounds was compiled from literature and FDA-approved drugs. Using a support vector machine (SVM), two groups of computational models were built to distinguish whether a compound is a blocker or non-blocker of hERG potassium ion channel. These mod- els fit generally satisfactory. The 100 models built with MACCS fingerprints (Model Group A) showed an average accuracy of 90% and an average matthews correlation coefficient (MCC) value of 0.77 on the test sets. The 100 models built with selected MOE descriptors (Model Group B) showed an average accuracy of 89% and an average MCC value of 0.74 on the test sets. Molecular hydrophobicity and lipophilicity were found to be very important factors which lead to block the hERG potassium ion channel. Some other molecular properties such as electrostatic properties, features based on van der Waals surface area, the number of rigid bonds and molecular surface rugosity also played important roles in blocking bERG potassium ion channel.