G-protein coupled receptors (GPCRs) are a class of seven-helix transmembrane proteins that have been used in bioinformatics as the targets to facilitate drug discovery for human diseases. Although thousands of GPCR ...G-protein coupled receptors (GPCRs) are a class of seven-helix transmembrane proteins that have been used in bioinformatics as the targets to facilitate drug discovery for human diseases. Although thousands of GPCR sequences have been collected, the ligand specificity of many GPCRs is still unknown and only one crystal structure of the rhodopsin-like family has been solved. Therefore, identifying GPCR types only from sequence data has become an important research issue. In this study, a novel technique for identifying GPCR types based on the weighted Levenshtein distance between two receptor sequences and the nearest neighbor method (NNM) is introduced, which can deal with receptor sequences with different lengths directly. In our experiments for classifying four classes (acetylcholine, adrenoceptor, dopamine, and serotonin) of the rhodopsin-like family of GPCRs, the error rates from the leave-one-out procedure and the leave-half-out procedure were 0.62% and 1.24%, respectively. These results are prior to those of the covariant discriminant algorithm, the support vector machine method, and the NNM with Euclidean distance.展开更多
Purpose-Adequate resources for learning and training the data are an important constraint to develop an efficient classifier with outstanding performance.The data usually follows a biased distribution of classes that ...Purpose-Adequate resources for learning and training the data are an important constraint to develop an efficient classifier with outstanding performance.The data usually follows a biased distribution of classes that reflects an unequal distribution of classes within a dataset.This issue is known as the imbalance problem,which is one of the most common issues occurring in real-time applications.Learning of imbalanced datasets is a ubiquitous challenge in the field of data mining.Imbalanced data degrades the performance of the classifier by producing inaccurate results.Design/methodology/approach-In the proposed work,a novel fuzzy-based Gaussian synthetic minority oversampling(FG-SMOTE)algorithm is proposed to process the imbalanced data.The mechanism of the Gaussian SMOTE technique is based on finding the nearest neighbour concept to balance the ratio between minority and majority class datasets.The ratio of the datasets belonging to the minority and majority class is balanced using a fuzzy-based Levenshtein distance measure technique.Findings-The performance and the accuracy of the proposed algorithm is evaluated using the deep belief networks classifier and the results showed the efficiency of the fuzzy-based Gaussian SMOTE technique achieved an AUC:93.7%.F1 Score Prediction:94.2%,Geometric Mean Score:93.6%predicted from confusion matrix.Research limitations/implications-The proposed research still retains some of the challenges that need to be focused such as application FG-SMOTE to multiclass imbalanced dataset and to evaluate dataset imbalance problem in a distributed environment.Originality/value-The proposed algorithm fundamentally solves the data imbalance issues and challenges involved in handling the imbalanced data.FG-SMOTE has aided in balancing minority and majority class datasets.展开更多
In this paper we propose a multiple feature approach for the normalization task which can map each disorder mention in the text to a unique unified medical language system(UMLS)concept unique identifier(CUI). We d...In this paper we propose a multiple feature approach for the normalization task which can map each disorder mention in the text to a unique unified medical language system(UMLS)concept unique identifier(CUI). We develop a two-step method to acquire a list of candidate CUIs and their associated preferred names using UMLS API and to choose the closest CUI by calculating the similarity between the input disorder mention and each candidate. The similarity calculation step is formulated as a classification problem and multiple features(string features,ranking features,similarity features,and contextual features) are used to normalize the disorder mentions. The results show that the multiple feature approach improves the accuracy of the normalization task from 32.99% to 67.08% compared with the Meta Map baseline.展开更多
基金supported by the Natural Science Foundation of Jiangsu Province(No.BK2004142)partly by the National Natural Science Foundation of China(No.60275007).
文摘G-protein coupled receptors (GPCRs) are a class of seven-helix transmembrane proteins that have been used in bioinformatics as the targets to facilitate drug discovery for human diseases. Although thousands of GPCR sequences have been collected, the ligand specificity of many GPCRs is still unknown and only one crystal structure of the rhodopsin-like family has been solved. Therefore, identifying GPCR types only from sequence data has become an important research issue. In this study, a novel technique for identifying GPCR types based on the weighted Levenshtein distance between two receptor sequences and the nearest neighbor method (NNM) is introduced, which can deal with receptor sequences with different lengths directly. In our experiments for classifying four classes (acetylcholine, adrenoceptor, dopamine, and serotonin) of the rhodopsin-like family of GPCRs, the error rates from the leave-one-out procedure and the leave-half-out procedure were 0.62% and 1.24%, respectively. These results are prior to those of the covariant discriminant algorithm, the support vector machine method, and the NNM with Euclidean distance.
基金Disclosure Statement:No potential conflict of interest was reported by the authors.
文摘Purpose-Adequate resources for learning and training the data are an important constraint to develop an efficient classifier with outstanding performance.The data usually follows a biased distribution of classes that reflects an unequal distribution of classes within a dataset.This issue is known as the imbalance problem,which is one of the most common issues occurring in real-time applications.Learning of imbalanced datasets is a ubiquitous challenge in the field of data mining.Imbalanced data degrades the performance of the classifier by producing inaccurate results.Design/methodology/approach-In the proposed work,a novel fuzzy-based Gaussian synthetic minority oversampling(FG-SMOTE)algorithm is proposed to process the imbalanced data.The mechanism of the Gaussian SMOTE technique is based on finding the nearest neighbour concept to balance the ratio between minority and majority class datasets.The ratio of the datasets belonging to the minority and majority class is balanced using a fuzzy-based Levenshtein distance measure technique.Findings-The performance and the accuracy of the proposed algorithm is evaluated using the deep belief networks classifier and the results showed the efficiency of the fuzzy-based Gaussian SMOTE technique achieved an AUC:93.7%.F1 Score Prediction:94.2%,Geometric Mean Score:93.6%predicted from confusion matrix.Research limitations/implications-The proposed research still retains some of the challenges that need to be focused such as application FG-SMOTE to multiclass imbalanced dataset and to evaluate dataset imbalance problem in a distributed environment.Originality/value-The proposed algorithm fundamentally solves the data imbalance issues and challenges involved in handling the imbalanced data.FG-SMOTE has aided in balancing minority and majority class datasets.
基金Supported by the National Natural Science Foundation of China(61133012,61202193,61373108)the Major Projects of the National Social Science Foundation of China(11&ZD189)+1 种基金the Chinese Postdoctoral Science Foundation(2013M540593,2014T70722)the Open Foundation of Shandong Key Laboratory of Language Resource Development and Application
文摘In this paper we propose a multiple feature approach for the normalization task which can map each disorder mention in the text to a unique unified medical language system(UMLS)concept unique identifier(CUI). We develop a two-step method to acquire a list of candidate CUIs and their associated preferred names using UMLS API and to choose the closest CUI by calculating the similarity between the input disorder mention and each candidate. The similarity calculation step is formulated as a classification problem and multiple features(string features,ranking features,similarity features,and contextual features) are used to normalize the disorder mentions. The results show that the multiple feature approach improves the accuracy of the normalization task from 32.99% to 67.08% compared with the Meta Map baseline.