Protein-protein interactions play a crucial role in the cellular processsuch as metabolic pathways and immunological recognition. This paper presents a new domain score-based support vector machine (SVM) to infer pr...Protein-protein interactions play a crucial role in the cellular processsuch as metabolic pathways and immunological recognition. This paper presents a new domain score-based support vector machine (SVM) to infer protein interactions, which can be used not only to explore all possible domain interactions by the kernel method, but also to reflect the evolutionary conservation of domains in proteins by using the domain scores of proteins. The experimental result on the Saccharomyces cerevisiae dataset demonstrates that this approach can predict protein-protein interactions with higher performances compared to the existing approaches.展开更多
Background:Acute pulmonary embolism(APE)is a fatal cardiovascular disease,yet missed diagnosis and misdiagnosis often occur due to non-specific symptoms and signs.A simple,objective technique will help clinicians make...Background:Acute pulmonary embolism(APE)is a fatal cardiovascular disease,yet missed diagnosis and misdiagnosis often occur due to non-specific symptoms and signs.A simple,objective technique will help clinicians make a quick and precise diagnosis.In population studies,machine learning(ML)plays a critical role in characterizing cardiovascular risks,predicting outcomes,and identifying biomarkers.This work sought to develop an ML model for helping APE diagnosis and compare it against current clinical probability assessment models.Methods:This is a single-center retrospective study.Patients with suspected APE were continuously enrolled and randomly divided into two groups including training and testing sets.A total of 8 ML models,including random forest(RF),Naïve Bayes,decision tree,K-nearest neighbors,logistic regression,multi-layer perceptron,support vector machine,and gradient boosting decision tree were developed based on the training set to diagnose APE.Thereafter,the model with the best diagnostic performance was selected and evaluated against the current clinical assessment strategies,including the Wells score,revised Geneva score,and Years algorithm.Eventually,the ML model was internally validated to assess the diagnostic performance using receiver operating characteristic(ROC)analysis.Results:The ML models were constructed using eight clinical features,including D-dimer,cardiac troponin T(cTNT),arterial oxygen saturation,heart rate,chest pain,lower limb pain,hemoptysis,and chronic heart failure.Among eight ML models,the RF model achieved the best performance with the highest area under the curve(AUC)(AUC=0.774).Compared to the current clinical assessment strategies,the RF model outperformed the Wells score(P=0.030)and was not inferior to any other clinical probability assessment strategy.The AUC of the RF model for diagnosing APE onset in internal validation set was 0.726.Conclusions:Based on RF algorithm,a novel prediction model was finally constructed for APE diagnosis.When compared to the current clinical assessment strategies,the RF model achieved better diagnostic efficacy and accuracy.Therefore,the ML algorithm can be a useful tool in assisting with the diagnosis of APE.展开更多
RNAs play crucial and versatile roles in biological processes. Computational prediction approaches can help to understand RNA structures and their stabilizing factors, thus providing information on their functions, an...RNAs play crucial and versatile roles in biological processes. Computational prediction approaches can help to understand RNA structures and their stabilizing factors, thus providing information on their functions, and facilitating the design of new RNAs. Machine learning (ML) techniques have made tremendous progress in many fields in the past few years. Although their usage in protein-related fields has a long history, the use of ML methods in predicting RNA tertiary structures is new and rare. Here, we review the recent advances of using ML methods on RNA structure predictions and discuss the advantages and limitation, the difficulties and potentials of these approaches when applied in the field.展开更多
The structure and function of proteins are closely related, and protein structure decides its function, therefore protein structure prediction is quite important.β-turns are important components of protein secondary ...The structure and function of proteins are closely related, and protein structure decides its function, therefore protein structure prediction is quite important.β-turns are important components of protein secondary structure. So development of an accurate prediction method ofβ-turn types is very necessary. In this paper, we used the composite vector with position conservation scoring function, increment of diversity and predictive secondary structure information as the input parameter of support vector machine algorithm for predicting theβ-turn types in the database of 426 protein chains, obtained the overall prediction accuracy of 95.6%, 97.8%, 97.0%, 98.9%, 99.2%, 91.8%, 99.4% and 83.9% with the Matthews Correlation Coefficient values of 0.74, 0.68, 0.20, 0.49, 0.23, 0.47, 0.49 and 0.53 for types I, II, VIII, I’, II’, IV, VI and nonturn respectively, which is better than other prediction.展开更多
Successful prediction of protein domain boundaries provides valuable information not only for the computational structure prediction of muhi-domain proteins but also for the experimental structure determination. A nov...Successful prediction of protein domain boundaries provides valuable information not only for the computational structure prediction of muhi-domain proteins but also for the experimental structure determination. A novel method for domain boundary prediction has been presented, which combines the support vector machine with domain guess by size algorithm. Since the evolutional information of multiple domains can be detected by position specific score matrix, the support vector machine method is trained and tested using the values of position specific score matrix generated by PSI-BLAST. The candidate domain boundaries are selected from the output of support vector machine, and are then inputted to domain guess by size algorithm to give the final results of domain boundary, prediction. The experimental results show that the combined method outperforms the individual method of both support vector machine and domain guess by size.展开更多
Based on the research of predictingβ-hairpin motifs in proteins, we apply Random Forest and Support Vector Machine algorithm to predictβ-hairpin motifs in ArchDB40 dataset. The motifs with the loop length of 2 to 8 ...Based on the research of predictingβ-hairpin motifs in proteins, we apply Random Forest and Support Vector Machine algorithm to predictβ-hairpin motifs in ArchDB40 dataset. The motifs with the loop length of 2 to 8 amino acid residues are extracted as research object and thefixed-length pattern of 12 amino acids are selected. When using the same characteristic parameters and the same test method, Random Forest algorithm is more effective than Support Vector Machine. In addition, because of Random Forest algorithm doesn’t produce overfitting phenomenon while the dimension of characteristic parameters is higher, we use Random Forest based on higher dimension characteristic parameters to predictβ-hairpin motifs. The better prediction results are obtained;the overall accuracy and Matthew’s correlation coefficient of 5-fold cross-validation achieve 83.3% and 0.59, respectively.展开更多
Using the latest available artificial intelligence (AI) technology, an advanced algorithm LIVERFAStTM has been used to evaluate the diagnostic accuracy of machine learning (ML) biomarker algorithms to assess liver dam...Using the latest available artificial intelligence (AI) technology, an advanced algorithm LIVERFAStTM has been used to evaluate the diagnostic accuracy of machine learning (ML) biomarker algorithms to assess liver damage. Prevalence of NAFLD (Nonalcoholic fatty liver disease) and resulting NASH (nonalcoholic steatohepatitis) are constantly increasing worldwide, creating challenges for screening as the diagnosis for NASH requires invasive liver biopsy. Key issues in NAFLD patients are the differentiation of NASH from simple steatosis and identification of advanced hepatic fibrosis. In this prospective study, the staging of three different lesions of the liver to diagnose fatty liver was analyzed using a proprietary ML algorithm LIVERFAStTM developed with a database of 2862 unique medical assessments of biomarkers, where 1027 assessments were used to train the algorithm and 1835 constituted the validation set. Data of 13,068 patients who underwent the LIVERFAStTM test for evaluation of fatty liver disease were analysed. Data evaluation revealed 11% of the patients exhibited significant fibrosis with fibrosis scores 0.6 - 1.00. Approximately 7% of the population had severe hepatic inflammation. Steatosis was observed in most patients, 63%, whereas severe steatosis S3 was observed in 20%. Using modified SAF (Steatosis, Activity and Fibrosis) scores obtained using the LIVERFAStTM algorithm, NAFLD was detected in 13.41% of the patients (Sx > 0, Ay 0). Approximately 1.91% (Sx > 0, Ay = 2, Fz > 0) of the patients showed NAFLD or NASH scorings while 1.08% had confirmed NASH (Sx > 0, Ay > 2, Fz = 1 - 2) and 1.49% had advanced NASH (Sx > 0, Ay > 2, Fz = 3 - 4). The modified SAF scoring system generated by LIVERFAStTM provides a simple and convenient evaluation of NAFLD and NASH in a cohort of Southeast Asians. This system may lead to the use of noninvasive liver tests in extended populations for more accurate diagnosis of liver pathology, prediction of clinical path of individuals at all stages of liver diseases, and provision of an efficient system for therapeutic interventions.展开更多
According to the Food and Agriculture Organization of the United Nations (FAO), there are about 500 million smallholder farmers in the world, and in developing countries, such farmers produce about 80% of the food con...According to the Food and Agriculture Organization of the United Nations (FAO), there are about 500 million smallholder farmers in the world, and in developing countries, such farmers produce about 80% of the food consumed there;their farming activities are therefore critical to the economies of their countries and to the global food security. However, these farmers face the challenges of limited access to credit, often due to the fact that many of them farm on unregistered land that cannot be offered as collateral to lending institutions;but even when they are on registered land, the fear of losing such land that they should default on loan payments often prevents them from applying for farm credit;and even if they apply, they still get disadvantaged by low credit scores (a measure of creditworthiness). The result is that they are often unable to use optimal farm inputs such as fertilizer and good seeds among others. This depresses their yields, and in turn, has negative implications for the food security in their communities, and in the world, hence making it difficult for the UN to achieve its sustainable goal no.2 (no hunger). This study aimed to demonstrate how geospatial technology can be used to leverage farm credit scoring for the benefit of smallholder farmers. A survey was conducted within the study area to identify the smallholder farms and farmers. A sample of surveyed farmers was then subjected to credit scoring by machine learning. In the first instance, the traditional financial data approach was used and the results showed that over 40% of the farmers could not qualify for credit. When non-financial geospatial data, i.e. Normalized Difference Vegetation Index (NDVI) was introduced into the scoring model, the number of farmers not qualifying for credit reduced significantly to 24%. It is concluded that the introduction of the NDVI variable into the traditional scoring model could improve significantly the smallholder farmers’ chances of accessing credit, thus enabling such a farmer to be better evaluated for credit on the basis of the health of their crop, rather than on a traditional form of collateral.展开更多
Accurate prediction of protein-ligand complex structures is a crucial step in structure-based drug design.Traditional molecular docking methods exhibit limitations in terms of accuracy and sampling space,while relying...Accurate prediction of protein-ligand complex structures is a crucial step in structure-based drug design.Traditional molecular docking methods exhibit limitations in terms of accuracy and sampling space,while relying on machine-learning approaches may lead to invalid conformations.In this study,we propose a novel strategy that combines molecular docking and machine learning methods.Firstly,the protein-ligand binding poses are predicted using a deep learning model.Subsequently,position-restricted docking on predicted binding poses is performed using Uni-Dock,generating physically constrained and valid binding poses.Finally,the binding poses are re-scored and ranked using machine learning scoring functions.This strategy harnesses the predictive power of machine learning and the physical constraints advantage of molecular docking.Evaluation experiments on multiple datasets demonstrate that,compared to using molecular docking or machine learning methods alone,our proposed strategy can significantly improve the success rate and accuracy of protein-ligand complex structure predictions.展开更多
基金supported by the National Natural Science Foundation of China (Grant No.30571059)the National High-Technology Research and Development Program of China (Grant No.2006AA02Z190)the Shanghai Leading Academic Discipline Project (Grant No.S30405)
文摘Protein-protein interactions play a crucial role in the cellular processsuch as metabolic pathways and immunological recognition. This paper presents a new domain score-based support vector machine (SVM) to infer protein interactions, which can be used not only to explore all possible domain interactions by the kernel method, but also to reflect the evolutionary conservation of domains in proteins by using the domain scores of proteins. The experimental result on the Saccharomyces cerevisiae dataset demonstrates that this approach can predict protein-protein interactions with higher performances compared to the existing approaches.
基金supported by grants from the Chinese Academy of Medical Sciences Innovation Fund for Medical Sciences(No.2021-I2M-1-049)the Elite Medical Professionals Project of China-Japan Friendship Hospital(No.ZRJY2021-BJ02)the National High Level Hospital Clinical Research Funding(No.2022-NHLHCRF-LX-01).
文摘Background:Acute pulmonary embolism(APE)is a fatal cardiovascular disease,yet missed diagnosis and misdiagnosis often occur due to non-specific symptoms and signs.A simple,objective technique will help clinicians make a quick and precise diagnosis.In population studies,machine learning(ML)plays a critical role in characterizing cardiovascular risks,predicting outcomes,and identifying biomarkers.This work sought to develop an ML model for helping APE diagnosis and compare it against current clinical probability assessment models.Methods:This is a single-center retrospective study.Patients with suspected APE were continuously enrolled and randomly divided into two groups including training and testing sets.A total of 8 ML models,including random forest(RF),Naïve Bayes,decision tree,K-nearest neighbors,logistic regression,multi-layer perceptron,support vector machine,and gradient boosting decision tree were developed based on the training set to diagnose APE.Thereafter,the model with the best diagnostic performance was selected and evaluated against the current clinical assessment strategies,including the Wells score,revised Geneva score,and Years algorithm.Eventually,the ML model was internally validated to assess the diagnostic performance using receiver operating characteristic(ROC)analysis.Results:The ML models were constructed using eight clinical features,including D-dimer,cardiac troponin T(cTNT),arterial oxygen saturation,heart rate,chest pain,lower limb pain,hemoptysis,and chronic heart failure.Among eight ML models,the RF model achieved the best performance with the highest area under the curve(AUC)(AUC=0.774).Compared to the current clinical assessment strategies,the RF model outperformed the Wells score(P=0.030)and was not inferior to any other clinical probability assessment strategy.The AUC of the RF model for diagnosing APE onset in internal validation set was 0.726.Conclusions:Based on RF algorithm,a novel prediction model was finally constructed for APE diagnosis.When compared to the current clinical assessment strategies,the RF model achieved better diagnostic efficacy and accuracy.Therefore,the ML algorithm can be a useful tool in assisting with the diagnosis of APE.
基金Project supported by the National Natural Science Foundation of China (Grant Nos. 11774158, 11974173, 11774157, and 11934008)。
文摘RNAs play crucial and versatile roles in biological processes. Computational prediction approaches can help to understand RNA structures and their stabilizing factors, thus providing information on their functions, and facilitating the design of new RNAs. Machine learning (ML) techniques have made tremendous progress in many fields in the past few years. Although their usage in protein-related fields has a long history, the use of ML methods in predicting RNA tertiary structures is new and rare. Here, we review the recent advances of using ML methods on RNA structure predictions and discuss the advantages and limitation, the difficulties and potentials of these approaches when applied in the field.
文摘The structure and function of proteins are closely related, and protein structure decides its function, therefore protein structure prediction is quite important.β-turns are important components of protein secondary structure. So development of an accurate prediction method ofβ-turn types is very necessary. In this paper, we used the composite vector with position conservation scoring function, increment of diversity and predictive secondary structure information as the input parameter of support vector machine algorithm for predicting theβ-turn types in the database of 426 protein chains, obtained the overall prediction accuracy of 95.6%, 97.8%, 97.0%, 98.9%, 99.2%, 91.8%, 99.4% and 83.9% with the Matthews Correlation Coefficient values of 0.74, 0.68, 0.20, 0.49, 0.23, 0.47, 0.49 and 0.53 for types I, II, VIII, I’, II’, IV, VI and nonturn respectively, which is better than other prediction.
基金Supported by the National Natural Science Foundation of China (No. 60435020)
文摘Successful prediction of protein domain boundaries provides valuable information not only for the computational structure prediction of muhi-domain proteins but also for the experimental structure determination. A novel method for domain boundary prediction has been presented, which combines the support vector machine with domain guess by size algorithm. Since the evolutional information of multiple domains can be detected by position specific score matrix, the support vector machine method is trained and tested using the values of position specific score matrix generated by PSI-BLAST. The candidate domain boundaries are selected from the output of support vector machine, and are then inputted to domain guess by size algorithm to give the final results of domain boundary, prediction. The experimental results show that the combined method outperforms the individual method of both support vector machine and domain guess by size.
文摘Based on the research of predictingβ-hairpin motifs in proteins, we apply Random Forest and Support Vector Machine algorithm to predictβ-hairpin motifs in ArchDB40 dataset. The motifs with the loop length of 2 to 8 amino acid residues are extracted as research object and thefixed-length pattern of 12 amino acids are selected. When using the same characteristic parameters and the same test method, Random Forest algorithm is more effective than Support Vector Machine. In addition, because of Random Forest algorithm doesn’t produce overfitting phenomenon while the dimension of characteristic parameters is higher, we use Random Forest based on higher dimension characteristic parameters to predictβ-hairpin motifs. The better prediction results are obtained;the overall accuracy and Matthew’s correlation coefficient of 5-fold cross-validation achieve 83.3% and 0.59, respectively.
文摘Using the latest available artificial intelligence (AI) technology, an advanced algorithm LIVERFAStTM has been used to evaluate the diagnostic accuracy of machine learning (ML) biomarker algorithms to assess liver damage. Prevalence of NAFLD (Nonalcoholic fatty liver disease) and resulting NASH (nonalcoholic steatohepatitis) are constantly increasing worldwide, creating challenges for screening as the diagnosis for NASH requires invasive liver biopsy. Key issues in NAFLD patients are the differentiation of NASH from simple steatosis and identification of advanced hepatic fibrosis. In this prospective study, the staging of three different lesions of the liver to diagnose fatty liver was analyzed using a proprietary ML algorithm LIVERFAStTM developed with a database of 2862 unique medical assessments of biomarkers, where 1027 assessments were used to train the algorithm and 1835 constituted the validation set. Data of 13,068 patients who underwent the LIVERFAStTM test for evaluation of fatty liver disease were analysed. Data evaluation revealed 11% of the patients exhibited significant fibrosis with fibrosis scores 0.6 - 1.00. Approximately 7% of the population had severe hepatic inflammation. Steatosis was observed in most patients, 63%, whereas severe steatosis S3 was observed in 20%. Using modified SAF (Steatosis, Activity and Fibrosis) scores obtained using the LIVERFAStTM algorithm, NAFLD was detected in 13.41% of the patients (Sx > 0, Ay 0). Approximately 1.91% (Sx > 0, Ay = 2, Fz > 0) of the patients showed NAFLD or NASH scorings while 1.08% had confirmed NASH (Sx > 0, Ay > 2, Fz = 1 - 2) and 1.49% had advanced NASH (Sx > 0, Ay > 2, Fz = 3 - 4). The modified SAF scoring system generated by LIVERFAStTM provides a simple and convenient evaluation of NAFLD and NASH in a cohort of Southeast Asians. This system may lead to the use of noninvasive liver tests in extended populations for more accurate diagnosis of liver pathology, prediction of clinical path of individuals at all stages of liver diseases, and provision of an efficient system for therapeutic interventions.
文摘According to the Food and Agriculture Organization of the United Nations (FAO), there are about 500 million smallholder farmers in the world, and in developing countries, such farmers produce about 80% of the food consumed there;their farming activities are therefore critical to the economies of their countries and to the global food security. However, these farmers face the challenges of limited access to credit, often due to the fact that many of them farm on unregistered land that cannot be offered as collateral to lending institutions;but even when they are on registered land, the fear of losing such land that they should default on loan payments often prevents them from applying for farm credit;and even if they apply, they still get disadvantaged by low credit scores (a measure of creditworthiness). The result is that they are often unable to use optimal farm inputs such as fertilizer and good seeds among others. This depresses their yields, and in turn, has negative implications for the food security in their communities, and in the world, hence making it difficult for the UN to achieve its sustainable goal no.2 (no hunger). This study aimed to demonstrate how geospatial technology can be used to leverage farm credit scoring for the benefit of smallholder farmers. A survey was conducted within the study area to identify the smallholder farms and farmers. A sample of surveyed farmers was then subjected to credit scoring by machine learning. In the first instance, the traditional financial data approach was used and the results showed that over 40% of the farmers could not qualify for credit. When non-financial geospatial data, i.e. Normalized Difference Vegetation Index (NDVI) was introduced into the scoring model, the number of farmers not qualifying for credit reduced significantly to 24%. It is concluded that the introduction of the NDVI variable into the traditional scoring model could improve significantly the smallholder farmers’ chances of accessing credit, thus enabling such a farmer to be better evaluated for credit on the basis of the health of their crop, rather than on a traditional form of collateral.
基金supported by the National Key Research and Development Program of China(2022YFA1004302)
文摘Accurate prediction of protein-ligand complex structures is a crucial step in structure-based drug design.Traditional molecular docking methods exhibit limitations in terms of accuracy and sampling space,while relying on machine-learning approaches may lead to invalid conformations.In this study,we propose a novel strategy that combines molecular docking and machine learning methods.Firstly,the protein-ligand binding poses are predicted using a deep learning model.Subsequently,position-restricted docking on predicted binding poses is performed using Uni-Dock,generating physically constrained and valid binding poses.Finally,the binding poses are re-scored and ranked using machine learning scoring functions.This strategy harnesses the predictive power of machine learning and the physical constraints advantage of molecular docking.Evaluation experiments on multiple datasets demonstrate that,compared to using molecular docking or machine learning methods alone,our proposed strategy can significantly improve the success rate and accuracy of protein-ligand complex structure predictions.