摘要
药物开发过程存在资本密度高、风险大、周期长的特点,需要投入大量的资金、人力与物力。传统的机器学习方法虽然可以在一定程度上辅助药物开发,但需要分子描述符作为特征输入,而不同的分子描述符的选择对机器学习模型的性能影响较大,因此传统的机器学习方法大多需要进行繁复、耗时的特征工程。近年新兴的深度学习方法,能够从药物的"原始"结构中直接提取特征,从而绕开特征工程,缩短开发周期。该文将现有的药物表示学习方法划分为2类:基于简化分子线性输入规范(SMILES)表达式的药物表示学习和基于分子图的药物表示学习,报告了这两类药物表示学习方法的最新研究进展,阐述了各种方法的创新点与局限性。最后,指出了当前药物表示学习研究中存在的重大挑战,并讨论了可能的解决方案。
The drug development process is characterized by large capital density, high risk and long cycles;thus, drug development requires much capital, manpower and resources. While traditional machine learning methods can aid drug development some, they require molecular descriptors as inputs. The selection of the molecular descriptors then greatly impacts the performance of the machine learning models. Therefore, most traditional machine learning methods require complex and time-consuming feature engineering. The emerging deep learning methods can directly learn the features from raw representations of the drugs which bypasses the feature engineering and shortens the drug development cycle. In this paper, the drug representation learning methods are divided into simplified molecular input line entry specification(SMILES) expression based drug representation learning methods and molecular graph based representation learning methods. This paper then surveys the innovations and limitations of various drug representation learning methods. This paper then identifies major challenges in current drug representation learning methods and presents possible solutions.
作者
陈鑫
刘喜恩
吴及
CHEN Xin;LIU Xien;WU Ji(Department of Electronic Engineering,Tsinghua University,Beijing 100084,China)
出处
《清华大学学报(自然科学版)》
EI
CAS
CSCD
北大核心
2020年第2期171-180,共10页
Journal of Tsinghua University(Science and Technology)
基金
国家重点研发计划(2018YFC0116800)。
关键词
药物
表示学习
简化分子线性输入规范(SMILES)
分子图
drug
representation learning
simplified molecular input line entry specification(SMILES)
molecular graph