Traditional sequence analysis depends on sequence alignment. In this study, we analyzed various functional regions of the human genome based on sequence features, including word frequency, dinucleotide relative abunda...Traditional sequence analysis depends on sequence alignment. In this study, we analyzed various functional regions of the human genome based on sequence features, including word frequency, dinucleotide relative abundance, and base-base correlation. We analyzed the human chromosome 22 and classified the upstream, exon, intron, downstream, and intergenic regions by principal component analysis and discriminant analysis of these features. The results show that we could classify the functional regions of genome based on sequence feature and discriminant analysis.展开更多
In order to study the gene sequence of Min pig Y-box binding protein (YB-1) gene, the complete coding sequence of Min pig YB-1 gene was cloned by RT-PCR, the sequence features were analyzed by some software and onli...In order to study the gene sequence of Min pig Y-box binding protein (YB-1) gene, the complete coding sequence of Min pig YB-1 gene was cloned by RT-PCR, the sequence features were analyzed by some software and online website. The results showed that the complete CDS of Min pig Y-box was found to be 975 bp long, encoding 324 amino acids. It contained a conserved cold shock domain and several phosphorylation sites, but had no transmembrane domains, and was consistent with a protein found in the cytoplasm. Min pig YB-1 nucleotides shared high similarity (61.37%- 97.66%) with other mammals.展开更多
Structure features need complicated pre-processing, and are probably domain-dependent. To reduce time cost of pre-processing, we propose a novel neural network architecture which is a bi-directional long-short-term-me...Structure features need complicated pre-processing, and are probably domain-dependent. To reduce time cost of pre-processing, we propose a novel neural network architecture which is a bi-directional long-short-term-memory recurrent-neural-network(Bi-LSTM-RNN) model based on low-cost sequence features such as words and part-of-speech(POS) tags, to classify the relation of two entities. First, this model performs bi-directional recurrent computation along the tokens of sentences. Then, the sequence is divided into five parts and standard pooling functions are applied over the token representations of each part. Finally, the token representations are concatenated and fed into a softmax layer for relation classification. We evaluate our model on two standard benchmark datasets in different domains, namely Sem Eval-2010 Task 8 and Bio NLP-ST 2016 Task BB3. In Sem Eval-2010 Task 8, the performance of our model matches those of the state-of-the-art models, achieving 83.0% in F1. In Bio NLP-ST 2016 Task BB3, our model obtains F1 51.3% which is comparable with that of the best system. Moreover, we find that the context between two target entities plays an important role in relation classification and it can be a replacement of the shortest dependency path.展开更多
The discovery of novel cancer genes is one of the main goals in cancer research.Bioinformatics methods can be used to accelerate cancer gene discovery,which may help in the understanding of cancer and the development ...The discovery of novel cancer genes is one of the main goals in cancer research.Bioinformatics methods can be used to accelerate cancer gene discovery,which may help in the understanding of cancer and the development of drug targets.In this paper,we describe a classifier to predict potential cancer genes that we have developed by integrating multiple biological evidence,including protein-protein interaction network properties,and sequence and functional features.We detected 55 features that were significantly different between cancer genes and non-cancer genes.Fourteen cancer-associated features were chosen to train the classifier.Four machine learning methods,logistic regression,support vector machines(SVMs),BayesNet and decision tree,were explored in the classifier models to distinguish cancer genes from non-cancer genes.The prediction power of the different models was evaluated by 5-fold cross-validation.The area under the receiver operating characteristic curve for logistic regression,SVM,Baysnet and J48 tree models was 0.834,0.740,0.800 and 0.782,respectively.Finally,the logistic regression classifier with multiple biological features was applied to the genes in the Entrez database,and 1976 cancer gene candidates were identified.We found that the integrated prediction model performed much better than the models based on the individual biological evidence,and the network and functional features had stronger powers than the sequence features in predicting cancer genes.展开更多
Order-preserving submatrix (OPSM) has become important in modelling biologically meaningful subspace cluster, capturing the general tendency of gene expressions across a subset of conditions. With the advance of mic...Order-preserving submatrix (OPSM) has become important in modelling biologically meaningful subspace cluster, capturing the general tendency of gene expressions across a subset of conditions. With the advance of microarray and analysis techniques, big volume of gene expression datasets and OPSM mining results are produced. OPSM query can efficiently retrieve relevant OPSMs from the huge amount of OPSM datasets. However, improving OPSM query relevancy remains a difficult task in real life exploratory data analysis processing. First, it is hard to capture subjective interestingness aspects, e.g., the analyst's expectation given her/his domain knowledge. Second, when these expectations can be declaratively specified, it is still challenging to use them during the computational process of OPSM queries. With the best of our knowledge, existing methods mainly fo- cus on batch OPSM mining, while few works involve OPSM query. To solve the above problems, the paper proposes two constrained OPSM query methods, which exploit userdefined constraints to search relevant results from two kinds of indices introduced. In this paper, extensive experiments are conducted on real datasets, and experiment results demonstrate that the multi-dimension index (cIndex) and enumerating sequence index (esIndex) based queries have better performance than brute force search.展开更多
Protein trafficking or protein sorting in eukaryotes is a complicated process and is carried out based on the information contaified in the protein. Many methods reported prediction of the subcellular location of prot...Protein trafficking or protein sorting in eukaryotes is a complicated process and is carried out based on the information contaified in the protein. Many methods reported prediction of the subcellular location of proteins from sequence information. However, most of these prediction methods use a flat structure or parallel architecture to perform prediction. In this work, we introduce ensemble classifiers with features that are extracted directly from full length protein sequences to predict locations in the protein-sorting pathway hierarchically. Sequence driven features, sequence mapped features and sequence autocorrelation features were tested with ensemble learners and their performances were compared. When evaluated by independent data testing, ensemble based-bagging algorithms with sequence feature composition, transition and distribution (CTD) successfully classified two datasets with accuracies greater than 90%. We compared our results with similar published methods, and our method equally performed with the others at two levels in the secreted pathway. This study shows that the feature CTD extracted from protein sequences is effective in capturing biological features among compartments in secreted pathways.展开更多
Chloroplast is a type of subcellular organelle in green plants and algae.It is the main subcellular organelle for conducting photosynthetic process.The proteins,which localize within the chloroplast,are responsible fo...Chloroplast is a type of subcellular organelle in green plants and algae.It is the main subcellular organelle for conducting photosynthetic process.The proteins,which localize within the chloroplast,are responsible for the photosynthetic process at molecular level.The chloroplast can be further divided into several compartments.Proteins in different compartments are related to different steps in the photosynthetic process.Since the molecular function of a protein is highly correlated to the exact cellular localization,pinpointing the subchloroplast location of a chloroplast protein is an important step towards the understanding of its role in the photosynthetic process.Experimental process for determining protein subchloroplast location is always costly and time consuming.Therefore,computational approaches were developed to predict the protein subchloroplast locations from the primary sequences.Over the last decades,more than a dozen studies have tried to predict protein subchloroplast locations with machine learning methods.Various sequence features and various machine learning algorithms have been introduced in this research topic.In this review,we collected the comprehensive information of all existing studies regarding the prediction of protein subchloroplast locations.We compare these studies in the aspects of benchmarking datasets,sequence features,machine learning algorithms,predictive performances,and the implementation availability.We summarized the progress and current status in this special research topic.We also try to figure out the most possible future works in predicting protein subchloroplast locations.We hope this review not only list all existing works,but also serve the readers as a useful resource for quickly grasping the big picture of this research topic.We also hope this review work can be a starting point of future methodology studies regarding the prediction of protein subchloroplast locations.展开更多
Mitochondrial disease was a clinically and genetically heterogeneous group of diseases, thus the diagnosis was very difficult to clinicians. Our objective was to analyze clinical and genetic characteristics of childre...Mitochondrial disease was a clinically and genetically heterogeneous group of diseases, thus the diagnosis was very difficult to clinicians. Our objective was to analyze clinical and genetic characteristics of children with mitochondrial disease in China. We tested 141 candidate patients who have been suspected of mitochondrial disorders by using targeted next-generation sequencing(NGS), and summarized the clinical and genetic data of gene confirmed cases from Neurology Department, Beijing Children's Hospital, Capital Medical University from October 2012 to January 2015. In our study, 40 cases of gene confirmed mitochondrial disease including eight kinds of mitochondrial disease, among which Leigh syndrome was identified to be the most common type, followed by mitochondrial encephalomyopathy, lactic acidosis, and stroke-like episodes(MELAS). The age-of-onset varies among mitochondrial disease, but early onset was common. All of 40 cases were gene confirmed, among which 25 cases(62.5%)with mitochondrial DNA(mtDNA) mutation, and 15 cases(37.5%) with nuclear DNA(nDNA) mutation. M.3243A>G(n=7)accounts for a large proportion of mtDNA mutation. The nDNA mutations include SURF1(n=7),PDHA1(n=2),and NDUFV1,NDUFAF6, SUCLA2, SUCLG1, RRM2 B, and C12orf65, respectively.展开更多
基金supported by the National High—Tech Research and Development Program(863 Program)of China(No.2002AA231071)the Natural Science Foundation of Jiangsu Province(No.BK2002057).
文摘Traditional sequence analysis depends on sequence alignment. In this study, we analyzed various functional regions of the human genome based on sequence features, including word frequency, dinucleotide relative abundance, and base-base correlation. We analyzed the human chromosome 22 and classified the upstream, exon, intron, downstream, and intergenic regions by principal component analysis and discriminant analysis of these features. The results show that we could classify the functional regions of genome based on sequence feature and discriminant analysis.
基金Supported by China Agricultural Research System(CARS-36)
文摘In order to study the gene sequence of Min pig Y-box binding protein (YB-1) gene, the complete coding sequence of Min pig YB-1 gene was cloned by RT-PCR, the sequence features were analyzed by some software and online website. The results showed that the complete CDS of Min pig Y-box was found to be 975 bp long, encoding 324 amino acids. It contained a conserved cold shock domain and several phosphorylation sites, but had no transmembrane domains, and was consistent with a protein found in the cytoplasm. Min pig YB-1 nucleotides shared high similarity (61.37%- 97.66%) with other mammals.
基金Supported by the China Postdoctoral Science Foundation(2014T70722)the Humanities and Social Science Foundation of Ministry of Education of China(16YJCZH004)
文摘Structure features need complicated pre-processing, and are probably domain-dependent. To reduce time cost of pre-processing, we propose a novel neural network architecture which is a bi-directional long-short-term-memory recurrent-neural-network(Bi-LSTM-RNN) model based on low-cost sequence features such as words and part-of-speech(POS) tags, to classify the relation of two entities. First, this model performs bi-directional recurrent computation along the tokens of sentences. Then, the sequence is divided into five parts and standard pooling functions are applied over the token representations of each part. Finally, the token representations are concatenated and fed into a softmax layer for relation classification. We evaluate our model on two standard benchmark datasets in different domains, namely Sem Eval-2010 Task 8 and Bio NLP-ST 2016 Task BB3. In Sem Eval-2010 Task 8, the performance of our model matches those of the state-of-the-art models, achieving 83.0% in F1. In Bio NLP-ST 2016 Task BB3, our model obtains F1 51.3% which is comparable with that of the best system. Moreover, we find that the context between two target entities plays an important role in relation classification and it can be a replacement of the shortest dependency path.
基金supported by the National Natural Science Foundation of China (31000591,31000587,31171266)
文摘The discovery of novel cancer genes is one of the main goals in cancer research.Bioinformatics methods can be used to accelerate cancer gene discovery,which may help in the understanding of cancer and the development of drug targets.In this paper,we describe a classifier to predict potential cancer genes that we have developed by integrating multiple biological evidence,including protein-protein interaction network properties,and sequence and functional features.We detected 55 features that were significantly different between cancer genes and non-cancer genes.Fourteen cancer-associated features were chosen to train the classifier.Four machine learning methods,logistic regression,support vector machines(SVMs),BayesNet and decision tree,were explored in the classifier models to distinguish cancer genes from non-cancer genes.The prediction power of the different models was evaluated by 5-fold cross-validation.The area under the receiver operating characteristic curve for logistic regression,SVM,Baysnet and J48 tree models was 0.834,0.740,0.800 and 0.782,respectively.Finally,the logistic regression classifier with multiple biological features was applied to the genes in the Entrez database,and 1976 cancer gene candidates were identified.We found that the integrated prediction model performed much better than the models based on the individual biological evidence,and the network and functional features had stronger powers than the sequence features in predicting cancer genes.
基金The authors thank the anonymous referees for their useful comments that greatly improved the quality of the paper. This work was supported in part by the National Basic Research Program 973 of China (2012CB316203), the Natural Science Foundation of China (Grant Nos. 61033007, 61272121, 61332014, 61572367, 61332006, 61472321, and 61502390), the National High Technology Research and Development Program 863 of China (2015AA015307), the Fundational Research Funds for the Central Universities (3102015JSJ0011, 3102014JSJ0005, and 3102014JSJ0013), and the Graduate Starting Seed Fund of Northwestern Polytechnical University (Z2012128).
文摘Order-preserving submatrix (OPSM) has become important in modelling biologically meaningful subspace cluster, capturing the general tendency of gene expressions across a subset of conditions. With the advance of microarray and analysis techniques, big volume of gene expression datasets and OPSM mining results are produced. OPSM query can efficiently retrieve relevant OPSMs from the huge amount of OPSM datasets. However, improving OPSM query relevancy remains a difficult task in real life exploratory data analysis processing. First, it is hard to capture subjective interestingness aspects, e.g., the analyst's expectation given her/his domain knowledge. Second, when these expectations can be declaratively specified, it is still challenging to use them during the computational process of OPSM queries. With the best of our knowledge, existing methods mainly fo- cus on batch OPSM mining, while few works involve OPSM query. To solve the above problems, the paper proposes two constrained OPSM query methods, which exploit userdefined constraints to search relevant results from two kinds of indices introduced. In this paper, extensive experiments are conducted on real datasets, and experiment results demonstrate that the multi-dimension index (cIndex) and enumerating sequence index (esIndex) based queries have better performance than brute force search.
文摘Protein trafficking or protein sorting in eukaryotes is a complicated process and is carried out based on the information contaified in the protein. Many methods reported prediction of the subcellular location of proteins from sequence information. However, most of these prediction methods use a flat structure or parallel architecture to perform prediction. In this work, we introduce ensemble classifiers with features that are extracted directly from full length protein sequences to predict locations in the protein-sorting pathway hierarchically. Sequence driven features, sequence mapped features and sequence autocorrelation features were tested with ensemble learners and their performances were compared. When evaluated by independent data testing, ensemble based-bagging algorithms with sequence feature composition, transition and distribution (CTD) successfully classified two datasets with accuracies greater than 90%. We compared our results with similar published methods, and our method equally performed with the others at two levels in the secreted pathway. This study shows that the feature CTD extracted from protein sequences is effective in capturing biological features among compartments in secreted pathways.
基金This work was supported by National Key R&D Program of China(2018YFC0910405),The National Natural Science Foundation of China(NSFC,Grant No.61872268)Open Project Funding of CAS Key Lab of Network Data Science and Technology,Institute of Computing Technology,Chinese Academy of Sciences(CASNDST201705).
文摘Chloroplast is a type of subcellular organelle in green plants and algae.It is the main subcellular organelle for conducting photosynthetic process.The proteins,which localize within the chloroplast,are responsible for the photosynthetic process at molecular level.The chloroplast can be further divided into several compartments.Proteins in different compartments are related to different steps in the photosynthetic process.Since the molecular function of a protein is highly correlated to the exact cellular localization,pinpointing the subchloroplast location of a chloroplast protein is an important step towards the understanding of its role in the photosynthetic process.Experimental process for determining protein subchloroplast location is always costly and time consuming.Therefore,computational approaches were developed to predict the protein subchloroplast locations from the primary sequences.Over the last decades,more than a dozen studies have tried to predict protein subchloroplast locations with machine learning methods.Various sequence features and various machine learning algorithms have been introduced in this research topic.In this review,we collected the comprehensive information of all existing studies regarding the prediction of protein subchloroplast locations.We compare these studies in the aspects of benchmarking datasets,sequence features,machine learning algorithms,predictive performances,and the implementation availability.We summarized the progress and current status in this special research topic.We also try to figure out the most possible future works in predicting protein subchloroplast locations.We hope this review not only list all existing works,but also serve the readers as a useful resource for quickly grasping the big picture of this research topic.We also hope this review work can be a starting point of future methodology studies regarding the prediction of protein subchloroplast locations.
文摘Mitochondrial disease was a clinically and genetically heterogeneous group of diseases, thus the diagnosis was very difficult to clinicians. Our objective was to analyze clinical and genetic characteristics of children with mitochondrial disease in China. We tested 141 candidate patients who have been suspected of mitochondrial disorders by using targeted next-generation sequencing(NGS), and summarized the clinical and genetic data of gene confirmed cases from Neurology Department, Beijing Children's Hospital, Capital Medical University from October 2012 to January 2015. In our study, 40 cases of gene confirmed mitochondrial disease including eight kinds of mitochondrial disease, among which Leigh syndrome was identified to be the most common type, followed by mitochondrial encephalomyopathy, lactic acidosis, and stroke-like episodes(MELAS). The age-of-onset varies among mitochondrial disease, but early onset was common. All of 40 cases were gene confirmed, among which 25 cases(62.5%)with mitochondrial DNA(mtDNA) mutation, and 15 cases(37.5%) with nuclear DNA(nDNA) mutation. M.3243A>G(n=7)accounts for a large proportion of mtDNA mutation. The nDNA mutations include SURF1(n=7),PDHA1(n=2),and NDUFV1,NDUFAF6, SUCLA2, SUCLG1, RRM2 B, and C12orf65, respectively.