摘要
针对传统细胞穿透肽的预测方法严重依赖于繁琐的特征抽取和特征重建步骤、算法复杂且准确度不高等问题,提出了一种利用自然语言处理中的字符嵌入方法结合CNN-LSTM组合机器学习框架来预测细胞穿透肽的方法.方法采用字符嵌入将氨基酸的代表字符通过网络学习映射到紧凑表示的向量空间中,每种氨基酸字符对应一个紧凑表示的向量,然后将肽序列通过由训练得到的嵌入向量转化为数值矩阵作为CNN-LSTM模型的输入,模型自行抽取特征后自动对输入序列的细胞穿透性进行预测.实验结果显示,在相同数据集进行实验时,研究的方法在测试集上的AUC (the area under ROC curve)值达到0.97,正确指数达到0.846,优于其它方法,说明上述方法能够简单、高效地进行细胞穿透肽的预测.
Traditional methods of cell-penetrating peptide prediction heavily rely on cumbersome feature extraction and reconstruction steps. These algorithms are complex and their accuracy still needs improving. To overcome the above shortcomings, this study proposes a novel method which uses character embedding method combined with CNN-LSTM machine learning framework for predicting cell-penetrating peptides. First, each amino acid character was mapped into a compact character embedding vector space through network learning on the training dataset. Second, peptides sequences were transformed into numerical feature matrix by these character embedding vectors. Finally, a CNN-LSTM model was used to automatically extract features for training, and gave the final prediction result of the input sequence. Experimental results show that, when comparing with other methods on the same datasets, our method performed best, achieving an AUC value of 0.97 and a correct index of 0.846, indicating that the proposed method is effective for improving the prediction of cell penetrating peptides.
作者
方春
孙福振
李彩虹
邢林林
FANG Chun;SUN Fun-zhen;LI Cai-hong;XING Lin-lin(School of Computer Science and Technology,Shandong University of Technology,Zibo Shandong 255049,China)
出处
《计算机仿真》
北大核心
2019年第10期353-358,共6页
Computer Simulation
基金
国家自然科学基金项目(61602280,61473179)
山东省自然科学基金项目(ZR2014FQ028)
关键词
深度学习
字符嵌入
细胞穿透肽
预测
Deep learning
Character embedding
Cell penetrating peptide
Prediction