摘要
DNA/single-wall carbon nanotube(SWCNT)hybrids have enabled many applications because of their special ability to disperse and sort SWCNTs by their chirality and handedness.Much work has been done to discover sequences which recognize specific chiralities of SWCNT,and significant progress has been made in understanding the underlying structure and thermodynamics of these hybrids.Nevertheless,de novo prediction of recognition sequences remains essentially impossible and the success rate for their discovery by search of the vast single-stranded DNA library is very low.Here,we report an effective way of predicting recognition sequences based on machine learning analysis of existing experimental sequence data sets.Multiple input feature construction methods(position-specific,term-frequency,combined or segmented term frequency vector,and motif-based feature)were used and compared.The transformed features were used to train several classifier algorithms(logistic regression,support vector machine,and artificial neural network).Trained models were used to predict new sets of recognition sequences,and consensus among a number of models was used successfully to counteract the limited size of the data set.Predictions were tested using aqueous two-phase separation.New data thus acquired were used to retrain the models by adding an experimentally tested new set of predicted sequences to the original set.The frequency of finding correct recognition sequences by the trained model increased to>50% from the~10% success rate in the original training data set.
基金
Y.Y.was supported by a Dean’s Fellowship.