In this study,our aim is to address the problem of gene selection by proposing a hybrid bio-inspired evolutionary algorithm that combines Grey Wolf Optimization(GWO)with Harris Hawks Optimization(HHO)for feature selec...In this study,our aim is to address the problem of gene selection by proposing a hybrid bio-inspired evolutionary algorithm that combines Grey Wolf Optimization(GWO)with Harris Hawks Optimization(HHO)for feature selection.Themotivation for utilizingGWOandHHOstems fromtheir bio-inspired nature and their demonstrated success in optimization problems.We aimto leverage the strengths of these algorithms to enhance the effectiveness of feature selection in microarray-based cancer classification.We selected leave-one-out cross-validation(LOOCV)to evaluate the performance of both two widely used classifiers,k-nearest neighbors(KNN)and support vector machine(SVM),on high-dimensional cancer microarray data.The proposed method is extensively tested on six publicly available cancer microarray datasets,and a comprehensive comparison with recently published methods is conducted.Our hybrid algorithm demonstrates its effectiveness in improving classification performance,Surpassing alternative approaches in terms of precision.The outcomes confirm the capability of our method to substantially improve both the precision and efficiency of cancer classification,thereby advancing the development ofmore efficient treatment strategies.The proposed hybridmethod offers a promising solution to the gene selection problem in microarray-based cancer classification.It improves the accuracy and efficiency of cancer diagnosis and treatment,and its superior performance compared to other methods highlights its potential applicability in realworld cancer classification tasks.By harnessing the complementary search mechanisms of GWO and HHO,we leverage their bio-inspired behavior to identify informative genes relevant to cancer diagnosis and treatment.展开更多
Acute leukemia is an aggressive disease that has high mortality rates worldwide.The error rate can be as high as 40%when classifying acute leukemia into its subtypes.So,there is an urgent need to support hematologists...Acute leukemia is an aggressive disease that has high mortality rates worldwide.The error rate can be as high as 40%when classifying acute leukemia into its subtypes.So,there is an urgent need to support hematologists during the classification process.More than two decades ago,researchers used microarray gene expression data to classify cancer and adopted acute leukemia as a test case.The high classification accuracy they achieved confirmed that it is possible to classify cancer subtypes using microarray gene expression data.Ensemble machine learning is an effective method that combines individual classifiers to classify new samples.Ensemble classifiers are recognized as powerful algorithms with numerous advantages over traditional classifiers.Over the past few decades,researchers have focused a great deal of attention on ensemble classifiers in a wide variety of fields,including but not limited to disease diagnosis,finance,bioinformatics,healthcare,manufacturing,and geography.This paper reviews the recent ensemble classifier approaches utilized for acute leukemia gene expression data classification.Moreover,a framework for classifying acute leukemia gene expression data is proposed.The pairwise correlation gene selection method and the Rotation Forest of Bayesian Networks are both used in this framework.Experimental outcomes show that the classification accuracy achieved by the acute leukemia ensemble classifiers constructed according to the suggested framework is good compared to the classification accuracy achieved in other studies.展开更多
In bioinformatics applications,examination of microarray data has received significant interest to diagnose diseases.Microarray gene expression data can be defined by a massive searching space that poses a primary cha...In bioinformatics applications,examination of microarray data has received significant interest to diagnose diseases.Microarray gene expression data can be defined by a massive searching space that poses a primary challenge in the appropriate selection of genes.Microarray data classification incorporates multiple disciplines such as bioinformatics,machine learning(ML),data science,and pattern classification.This paper designs an optimal deep neural network based microarray gene expression classification(ODNN-MGEC)model for bioinformatics applications.The proposed ODNN-MGEC technique performs data normalization process to normalize the data into a uniform scale.Besides,improved fruit fly optimization(IFFO)based feature selection technique is used to reduce the high dimensionality in the biomedical data.Moreover,deep neural network(DNN)model is applied for the classification of microarray gene expression data and the hyperparameter tuning of the DNN model is carried out using the Symbiotic Organisms Search(SOS)algorithm.The utilization of IFFO and SOS algorithms pave the way for accomplishing maximum gene expression classification outcomes.For examining the improved outcomes of the ODNN-MGEC technique,a wide ranging experimental analysis is made against benchmark datasets.The extensive comparison study with recent approaches demonstrates the enhanced outcomes of the ODNN-MGEC technique in terms of different measures.展开更多
Gene expression(GE)classification is a research trend as it has been used to diagnose and prognosis many diseases.Employing machine learning(ML)in the prediction of many diseases based on GE data has been a flourishin...Gene expression(GE)classification is a research trend as it has been used to diagnose and prognosis many diseases.Employing machine learning(ML)in the prediction of many diseases based on GE data has been a flourishing research area.However,some diseases,like Alzheimer’s disease(AD),have not received considerable attention,probably owing to data scarcity obstacles.In this work,we shed light on the prediction of AD from GE data accurately using ML.Our approach consists of four phases:preprocessing,gene selection(GS),classification,and performance validation.In the preprocessing phase,gene columns are preprocessed identically.In the GS phase,a hybrid filtering method and embedded method are used.In the classification phase,three ML models are implemented using the bare minimum of the chosen genes obtained from the previous phase.The final phase is to validate the performance of these classifiers using different metrics.The crux of this article is to select the most informative genes from the hybrid method,and the best ML technique to predict AD using this minimal set of genes.Five different datasets are used to achieve our goal.We predict AD with impressive values forMultiLayer Perceptron(MLP)classifier which has the best performance metrics in four datasets,and the Support Vector Machine(SVM)achieves the highest performance values in only one dataset.We assessed the classifiers using sevenmetrics;and received impressive results,allowing for a credible performance rating.The metrics values we obtain in our study lie in the range[.97,.99]for the accuracy(Acc),[.97,.99]for F1-score,[.94,.98]for kappa index,[.97,.99]for area under curve(AUC),[.95,1]for precision,[.98,.99]for sensitivity(recall),and[.98,1]for specificity.With these results,the proposed approach outperforms recent interesting results.With these results,the proposed approach outperforms recent interesting results.展开更多
The current study proposes a novel technique for feature selection by inculcating robustness in the conventional Signal to noise Ratio(SNR).The proposed method utilizes the robust measures of location i.e.,the“Median...The current study proposes a novel technique for feature selection by inculcating robustness in the conventional Signal to noise Ratio(SNR).The proposed method utilizes the robust measures of location i.e.,the“Median”as well as the measures of variation i.e.,“Median absolute deviation(MAD)and Interquartile range(IQR)”in the SNR.By this way,two independent robust signal-to-noise ratios have been proposed.The proposed method selects the most informative genes/features by combining the minimum subset of genes or features obtained via the greedy search approach with top-ranked genes selected through the robust signal-to-noise ratio(RSNR).The results obtained via the proposed method are compared with wellknown gene/feature selection methods on the basis of performance metric i.e.,classification error rate.A total of 5 gene expression datasets have been used in this study.Different subsets of informative genes are selected by the proposed and all the other methods included in the study,and their efficacy in terms of classification is investigated by using the classifier models such as support vector machine(SVM),Random forest(RF)and k-nearest neighbors(k-NN).The results of the analysis reveal that the proposed method(RSNR)produces minimum error rates than all the other competing feature selection methods in majority of the cases.For further assessment of the method,a detailed simulation study is also conducted.展开更多
Background:Recently,researchers have been attracted in identifying the crucial genes related to cancer,which plays important role in cancer diagnosis and treatment.However,in performing the cancer molecular subtype cl...Background:Recently,researchers have been attracted in identifying the crucial genes related to cancer,which plays important role in cancer diagnosis and treatment.However,in performing the cancer molecular subtype classification task from cancer gene expression data,it is challenging to obtain those significant genes due to the high dimensionality and high noise of data.Moreover,the existing methods always suffer from some issues such as premature convergence.Methods:To address those problems,we propose a new ant colony optimization(ACO)algorithm called DACO to classify the cancer gene expression datasets,identifying the essential genes of different diseases.In DACO,first,we propose the initial pheromone concentration based on the weight ranking vector to accelerate the convergence speed;then,a dynamic pheromone volatility factor is designed to prevent the algorithm from getting stuck in the local optimal solution;finally,the pheromone update rule in the Ant Colony System is employed to update the pheromone globally and locally.To demonstrate the performance of the proposed algorithm in classification,different existing approaches are compared with the proposed algorithm on eight high-dimensional cancer gene expression datasets.Results:The experiment results show that the proposed algorithm performs better than other effective methods in terms of classification accuracy and the number of feature sets.It can be used to address the classification problem effectively.Moreover,a renal cell carcinoma dataset is employed to reveal the biological significance of the proposed algorithm from a number of biological analyses.Conclusion:The results demonstrate that CAPS may play a crucial role in the occurrence and development of renal clear cell carcinoma.展开更多
Both microRNA (miRNA) and mRNA expression profiles are important methods for cancer type classification. A comparative study of their classification performance will be helpful in choosing the means of classificatio...Both microRNA (miRNA) and mRNA expression profiles are important methods for cancer type classification. A comparative study of their classification performance will be helpful in choosing the means of classification. Here we evaluated the classification performance of miRNA and mRNA profiles using a new data mining approach based on a novel SVM (Support Vector Machines) based recursive fea- ture elimination (nRFE) algorithm. Computational experiments showed that information encoded in miRNAs is not sufficient to classify cancers; gut-derived samples cluster more accurately when using mRNA expression profiles compared with using miRNA profiles; and poorly differentiated tumors (PDT) could be classified by mRNA expression profiles at the accuracy of 100% versus 93.8% when using miRNA profiles. Furthermore, we showed that mRNA expression profiles have higher capacity in normal tissue classifications than miRNA. We concluded that classification performance using mRNA profiles is superior to that of miRNA profiles in multiple-class cancer classifications.展开更多
The classification of cancer is a major research topic in bioinformatics. The nature of high dimensionality and small size associated with gene expression data,however,makes the classification quite challenging. Altho...The classification of cancer is a major research topic in bioinformatics. The nature of high dimensionality and small size associated with gene expression data,however,makes the classification quite challenging. Although principal component analysis (PCA) is of particular interest for the high-dimensional data,it may overemphasize some aspects and ignore some other important information contained in the richly complex data,because it displays only the difference in the first twoor three-dimensional PC subspaces. Based on PCA,a principal component accumulation (PCAcc) method was proposed. It employs the information contained in multiple PC subspaces and improves the class separability of cancers. The effectiveness of the present method was evaluated by four commonly used gene expression datasets,and the results show that the method performs well for cancer classification.展开更多
Non-specific lipid transfer proteins(nsLTPs) are small, basic proteins that are characterized by an eight-cysteine motif. The biological functions of these proteins have been reported to involve plant reproduction and...Non-specific lipid transfer proteins(nsLTPs) are small, basic proteins that are characterized by an eight-cysteine motif. The biological functions of these proteins have been reported to involve plant reproduction and biotic or abiotic stress response. With the completion of the barley genome sequence, a genome-wide analysis of nsLTPs in barley(Hordeum vulgare L.)(HvLTPs) will be helpful for understanding the function of nsLTPs in plants. We performed a genome-wide analysis of the nsLTP gene family in barley and identified 70 nsLTP genes,which can be divided into five types(1, 2, C, D, and G). Each type of nsLTPs shares similar exon and intron gene structures. Expression analysis showed that barley nsLTPs have diverse expression patterns, revealing their various roles. Our results shed light on the phylogenetic relationships and potential functions of barley nsLTPs and will be useful for future studies of barley development and molecular breeding.展开更多
Differences between healthy subjects and associated disease risks are of substantial interest in clinical medicine. Based on clinical presentations, Traditional Chinese Medicine (TCM) classifies healthy people into ...Differences between healthy subjects and associated disease risks are of substantial interest in clinical medicine. Based on clinical presentations, Traditional Chinese Medicine (TCM) classifies healthy people into nine constitutions: Balanced, Qi, Yang or Yin deficiency, Phlegm-dampness, Damp-heat, Blood stasis, Qi stagnation, and Inherited special constitutions. In particular, Yang and Yin deficiency constitutions exhibit cold and heat aversion, respectively. However, the intrinsic molecular characteristics of unbal- anced phenotypes remain unclear. To determine whether gene expression-based clustering can reca- pitulate TCM-based classification, peripheral blood mononudear cells (PBMCs) were collected from Chinese Han individuals with Yang/Yin deficiency (n = 12 each) and Balanced (n = 8) constitutions, and global gene expression profiles were determined using the Affymetrix HC-UI33A Plus 2.0 array. Notably, we found that gene expression-based classifications reflected distinct TCM-based subtypes. Consistent with the clinical observation that subjects with Yang deficiency tend toward obesity, series-clustering analysis detected several key lipid metabolic genes (diacylglycerol acyltransferase (DGAT2), acyl-CoA synthetase (ACSL1), and ATP-hinding cassette subfamily A member 1 (ABCAI)) to be down- and up- regulated in Yin and Yang deficiency constitutions, respectively. Our findings suggest that Yin]Yang deficiency and Balanced constitutions are unique entities in their mRNA expression profiles. Moreover, the distinct physical and clinical characteristics of each unbalanced constitution can be explained, in part, by specific gene expression signatures.展开更多
Machine-learning algorithms have been widely used in breast cancer diagnosis to help pathologists and physicians in the decision-making process.However,the high dimensionality of genetic data makes the classification ...Machine-learning algorithms have been widely used in breast cancer diagnosis to help pathologists and physicians in the decision-making process.However,the high dimensionality of genetic data makes the classification process a challenging task.In this paper,we propose a new optimized wrapper gene selection method that is based on a nature-inspired algorithm(simulated annealing(SA)),which will help select the most informative genes for breast cancer prediction.These optimal genes will then be used to train the classifier to improve its accuracy and efficiency.Three supervised machine-learning algorithms,namely,the support vector machine,the decision tree,and the random forest were used to create the classifier models that will help to predict breast cancer.Two different experiments were conducted using three datasets:Gene expression(GE),deoxyribonucleic acid(DNA)methylation,and a combination of the two.Six measures were used to evaluate the performance of the proposed algorithm,which include the following:Accuracy,precision,recall,specificity,area under the curve(AUC),and execution time.The effectiveness of the proposed classifiers was evaluated through comprehensive experiments.The results demonstrated that our approach outperformed the conventional classifiers as expected in terms of accuracy and execution time.High accuracy values of 99.77%,99.45%,and 99.45%have been achieved by SA-SVM for GE,DNA methylation,and the combined datasets,respectively.The execution time of the proposed approach was significantly reduced,in comparison to that of the traditional classifiers and the best execution time has been reached by SA-SVM,which was 0.02,0.03,and 0.02 on GE,DNA methylation,and the combined datasets respectively.In regard to precision and specificity,SA-RF obtained the best result of 100 on GE dataset.While SA-SVM attained the best recall result of 100 on GE dataset.展开更多
Classification of gene expression data is a pivotal research area that plays a substantial role in diagnosis and prediction of diseases. Generally, feature selection is one of the extensively used techniques in data m...Classification of gene expression data is a pivotal research area that plays a substantial role in diagnosis and prediction of diseases. Generally, feature selection is one of the extensively used techniques in data mining approaches, especially in classification. Gene expression data are usually composed of dozens of samples characterized by thousands of genes. This increases the dimensionality coupled with the existence of irrelevant and redundant features. Accordingly, the selection of informative genes (features) becomes difficult, which badly affects the gene classification accuracy. In this paper, we consider the feature selection for classifying gene expression microarray datasets. The goal is to detect the most possibly cancer-related genes in a distributed manner, which helps in effectively classifying the samples. Initially, the available huge amount of considered features are subdivided and distributed among several processors. Then, a new filter selection method based on a fuzzy inference system is applied to each subset of the dataset. Finally, all the resulted features are ranked, then a wrapper-based selection method is applied. Experimental results showed that our proposed feature selection technique performs better than other techniques since it produces lower time latency and improves classification performance.展开更多
Background Early stage (FIGO stage Ⅰ-Ⅱ) endometrioid endometrial adenocarcinoma (EEA) is very common in clinical practice.However,patients with the early stage EEA show various clinical behaviors due to biologic...Background Early stage (FIGO stage Ⅰ-Ⅱ) endometrioid endometrial adenocarcinoma (EEA) is very common in clinical practice.However,patients with the early stage EEA show various clinical behaviors due to biological heterogeneity.Hence,we aimed to discover distinct classes of tumors based on gene expression profiling,and analyze whether the molecular classification correlated with the histopathological stages or other clinical parameters.Methods Hierarchical clustering was performed for class discovery in 28 eady stage EEA samples using a special cDNA microarray chip containing 492 genes designed for endometrial cancer.Correlations between clinicopathologic parameters and our classification were analyzed.And the significance analysis of microarrays (SAM) array was used to identify the signature genes according to the tumor grade and myometrial invasion.Results Three tumor subtypes (subtypes Ⅰ,Ⅱ and Ⅲ) were identified by hierarchical clustering,each subtype had different clinicopathological factors,such as tumor grade,myometrial invasion status,and FIGO stage.Moreover,SAM analysis showed 34 up-regulated genes in high grade tumors,and 38 up-regulated genes and 1 down-regulated in deep myometrial invasive tumors.The overlap genes between these two high-risk factors were markedly up-regulated in subtype Ⅰ,but down-regulated in subtype Ⅲ.Conclusion We have identified novel molecular subtypes in early stage EEA.Differential gene signatures characterize each tumor subtype,which could be used for recognizing the tumor risk and providing a basis for further treatment stratification.展开更多
To determine cancer pathway activities in nine types of primary tumors and NCI60 cell lines, we applied an in silico approach by examining gene signatures reflective of consequent pathway activation using gene express...To determine cancer pathway activities in nine types of primary tumors and NCI60 cell lines, we applied an in silico approach by examining gene signatures reflective of consequent pathway activation using gene expression data. Supervised learning approaches predicted that the Ras pathway is active in -70% of lung adenocarcinomas but inactive in most squamous cell carcinomas, pulmonary carcinoids, and small cell lung carcinomas. In contrast, the TGF-β, TNF-α, Src, Myc, E2F3, and β-catenin pathways are inactive in lung adenocarcinomas. We predicted an active Ras, Myc, Src, and/or E2F3 pathway in significant percentages of breast cancer, colorectal carcinoma, and gliomas. Our results also suggest that Ras may be the most prevailing oncogenic pathway. Additionally, many NCI60 cell lines exhibited a gene signature indicative of an active Ras, Myc, and/or Src, but not E2F3, β-catenin, TNF-α, or TGF-β pathway. To our knowledge, this is the first comprehensive survey of cancer pathway activities in nine major tumor types and the most widely used NCI60 cell lines. The "gene expression pathway signatures" we have defined could facilitate the understanding of molecular mechanisms in cancer development and provide guidance to the selection of appropriate cell lines for cancer research and pharmaceutical compound screening.展开更多
基金the Deputyship for Research and Innovation,“Ministry of Education”in Saudi Arabia for funding this research(IFKSUOR3-014-3).
文摘In this study,our aim is to address the problem of gene selection by proposing a hybrid bio-inspired evolutionary algorithm that combines Grey Wolf Optimization(GWO)with Harris Hawks Optimization(HHO)for feature selection.Themotivation for utilizingGWOandHHOstems fromtheir bio-inspired nature and their demonstrated success in optimization problems.We aimto leverage the strengths of these algorithms to enhance the effectiveness of feature selection in microarray-based cancer classification.We selected leave-one-out cross-validation(LOOCV)to evaluate the performance of both two widely used classifiers,k-nearest neighbors(KNN)and support vector machine(SVM),on high-dimensional cancer microarray data.The proposed method is extensively tested on six publicly available cancer microarray datasets,and a comprehensive comparison with recently published methods is conducted.Our hybrid algorithm demonstrates its effectiveness in improving classification performance,Surpassing alternative approaches in terms of precision.The outcomes confirm the capability of our method to substantially improve both the precision and efficiency of cancer classification,thereby advancing the development ofmore efficient treatment strategies.The proposed hybridmethod offers a promising solution to the gene selection problem in microarray-based cancer classification.It improves the accuracy and efficiency of cancer diagnosis and treatment,and its superior performance compared to other methods highlights its potential applicability in realworld cancer classification tasks.By harnessing the complementary search mechanisms of GWO and HHO,we leverage their bio-inspired behavior to identify informative genes relevant to cancer diagnosis and treatment.
文摘Acute leukemia is an aggressive disease that has high mortality rates worldwide.The error rate can be as high as 40%when classifying acute leukemia into its subtypes.So,there is an urgent need to support hematologists during the classification process.More than two decades ago,researchers used microarray gene expression data to classify cancer and adopted acute leukemia as a test case.The high classification accuracy they achieved confirmed that it is possible to classify cancer subtypes using microarray gene expression data.Ensemble machine learning is an effective method that combines individual classifiers to classify new samples.Ensemble classifiers are recognized as powerful algorithms with numerous advantages over traditional classifiers.Over the past few decades,researchers have focused a great deal of attention on ensemble classifiers in a wide variety of fields,including but not limited to disease diagnosis,finance,bioinformatics,healthcare,manufacturing,and geography.This paper reviews the recent ensemble classifier approaches utilized for acute leukemia gene expression data classification.Moreover,a framework for classifying acute leukemia gene expression data is proposed.The pairwise correlation gene selection method and the Rotation Forest of Bayesian Networks are both used in this framework.Experimental outcomes show that the classification accuracy achieved by the acute leukemia ensemble classifiers constructed according to the suggested framework is good compared to the classification accuracy achieved in other studies.
基金The authors extend their appreciation to the Deanship of Scientific Research at King Khalid University for funding this work under grant number(RGP 2/42/43)This work was supported by Taif University Researchers Supporting Program(project number:TURSP-2020/200),Taif University,Saudi Arabia.
文摘In bioinformatics applications,examination of microarray data has received significant interest to diagnose diseases.Microarray gene expression data can be defined by a massive searching space that poses a primary challenge in the appropriate selection of genes.Microarray data classification incorporates multiple disciplines such as bioinformatics,machine learning(ML),data science,and pattern classification.This paper designs an optimal deep neural network based microarray gene expression classification(ODNN-MGEC)model for bioinformatics applications.The proposed ODNN-MGEC technique performs data normalization process to normalize the data into a uniform scale.Besides,improved fruit fly optimization(IFFO)based feature selection technique is used to reduce the high dimensionality in the biomedical data.Moreover,deep neural network(DNN)model is applied for the classification of microarray gene expression data and the hyperparameter tuning of the DNN model is carried out using the Symbiotic Organisms Search(SOS)algorithm.The utilization of IFFO and SOS algorithms pave the way for accomplishing maximum gene expression classification outcomes.For examining the improved outcomes of the ODNN-MGEC technique,a wide ranging experimental analysis is made against benchmark datasets.The extensive comparison study with recent approaches demonstrates the enhanced outcomes of the ODNN-MGEC technique in terms of different measures.
文摘Gene expression(GE)classification is a research trend as it has been used to diagnose and prognosis many diseases.Employing machine learning(ML)in the prediction of many diseases based on GE data has been a flourishing research area.However,some diseases,like Alzheimer’s disease(AD),have not received considerable attention,probably owing to data scarcity obstacles.In this work,we shed light on the prediction of AD from GE data accurately using ML.Our approach consists of four phases:preprocessing,gene selection(GS),classification,and performance validation.In the preprocessing phase,gene columns are preprocessed identically.In the GS phase,a hybrid filtering method and embedded method are used.In the classification phase,three ML models are implemented using the bare minimum of the chosen genes obtained from the previous phase.The final phase is to validate the performance of these classifiers using different metrics.The crux of this article is to select the most informative genes from the hybrid method,and the best ML technique to predict AD using this minimal set of genes.Five different datasets are used to achieve our goal.We predict AD with impressive values forMultiLayer Perceptron(MLP)classifier which has the best performance metrics in four datasets,and the Support Vector Machine(SVM)achieves the highest performance values in only one dataset.We assessed the classifiers using sevenmetrics;and received impressive results,allowing for a credible performance rating.The metrics values we obtain in our study lie in the range[.97,.99]for the accuracy(Acc),[.97,.99]for F1-score,[.94,.98]for kappa index,[.97,.99]for area under curve(AUC),[.95,1]for precision,[.98,.99]for sensitivity(recall),and[.98,1]for specificity.With these results,the proposed approach outperforms recent interesting results.With these results,the proposed approach outperforms recent interesting results.
基金King Saud University for funding this work through Researchers Supporting Project Number(RSP2022R426),King Saud University,Riyadh,Saudi Arabia.
文摘The current study proposes a novel technique for feature selection by inculcating robustness in the conventional Signal to noise Ratio(SNR).The proposed method utilizes the robust measures of location i.e.,the“Median”as well as the measures of variation i.e.,“Median absolute deviation(MAD)and Interquartile range(IQR)”in the SNR.By this way,two independent robust signal-to-noise ratios have been proposed.The proposed method selects the most informative genes/features by combining the minimum subset of genes or features obtained via the greedy search approach with top-ranked genes selected through the robust signal-to-noise ratio(RSNR).The results obtained via the proposed method are compared with wellknown gene/feature selection methods on the basis of performance metric i.e.,classification error rate.A total of 5 gene expression datasets have been used in this study.Different subsets of informative genes are selected by the proposed and all the other methods included in the study,and their efficacy in terms of classification is investigated by using the classifier models such as support vector machine(SVM),Random forest(RF)and k-nearest neighbors(k-NN).The results of the analysis reveal that the proposed method(RSNR)produces minimum error rates than all the other competing feature selection methods in majority of the cases.For further assessment of the method,a detailed simulation study is also conducted.
基金supported by the Langfang Science and Technology Plan Project(No.2018013151)from Hebei Petro China Central Hospital.
文摘Background:Recently,researchers have been attracted in identifying the crucial genes related to cancer,which plays important role in cancer diagnosis and treatment.However,in performing the cancer molecular subtype classification task from cancer gene expression data,it is challenging to obtain those significant genes due to the high dimensionality and high noise of data.Moreover,the existing methods always suffer from some issues such as premature convergence.Methods:To address those problems,we propose a new ant colony optimization(ACO)algorithm called DACO to classify the cancer gene expression datasets,identifying the essential genes of different diseases.In DACO,first,we propose the initial pheromone concentration based on the weight ranking vector to accelerate the convergence speed;then,a dynamic pheromone volatility factor is designed to prevent the algorithm from getting stuck in the local optimal solution;finally,the pheromone update rule in the Ant Colony System is employed to update the pheromone globally and locally.To demonstrate the performance of the proposed algorithm in classification,different existing approaches are compared with the proposed algorithm on eight high-dimensional cancer gene expression datasets.Results:The experiment results show that the proposed algorithm performs better than other effective methods in terms of classification accuracy and the number of feature sets.It can be used to address the classification problem effectively.Moreover,a renal cell carcinoma dataset is employed to reveal the biological significance of the proposed algorithm from a number of biological analyses.Conclusion:The results demonstrate that CAPS may play a crucial role in the occurrence and development of renal clear cell carcinoma.
基金supported by a grant from the National High-tech R&D Program (863 Program, No. 2006AA02Z331) to Liangbiao Chen
文摘Both microRNA (miRNA) and mRNA expression profiles are important methods for cancer type classification. A comparative study of their classification performance will be helpful in choosing the means of classification. Here we evaluated the classification performance of miRNA and mRNA profiles using a new data mining approach based on a novel SVM (Support Vector Machines) based recursive fea- ture elimination (nRFE) algorithm. Computational experiments showed that information encoded in miRNAs is not sufficient to classify cancers; gut-derived samples cluster more accurately when using mRNA expression profiles compared with using miRNA profiles; and poorly differentiated tumors (PDT) could be classified by mRNA expression profiles at the accuracy of 100% versus 93.8% when using miRNA profiles. Furthermore, we showed that mRNA expression profiles have higher capacity in normal tissue classifications than miRNA. We concluded that classification performance using mRNA profiles is superior to that of miRNA profiles in multiple-class cancer classifications.
基金supported by the National Natural Science Foundation of China (20835002)International Science and Technology Cooperation Program of the Ministry of Science and Technology (MOST) of China (2008DFA32250)
文摘The classification of cancer is a major research topic in bioinformatics. The nature of high dimensionality and small size associated with gene expression data,however,makes the classification quite challenging. Although principal component analysis (PCA) is of particular interest for the high-dimensional data,it may overemphasize some aspects and ignore some other important information contained in the richly complex data,because it displays only the difference in the first twoor three-dimensional PC subspaces. Based on PCA,a principal component accumulation (PCAcc) method was proposed. It employs the information contained in multiple PC subspaces and improves the class separability of cancers. The effectiveness of the present method was evaluated by four commonly used gene expression datasets,and the results show that the method performs well for cancer classification.
基金supported by funds from the National Key Research and Development Program of China (2016YFD0100903)
文摘Non-specific lipid transfer proteins(nsLTPs) are small, basic proteins that are characterized by an eight-cysteine motif. The biological functions of these proteins have been reported to involve plant reproduction and biotic or abiotic stress response. With the completion of the barley genome sequence, a genome-wide analysis of nsLTPs in barley(Hordeum vulgare L.)(HvLTPs) will be helpful for understanding the function of nsLTPs in plants. We performed a genome-wide analysis of the nsLTP gene family in barley and identified 70 nsLTP genes,which can be divided into five types(1, 2, C, D, and G). Each type of nsLTPs shares similar exon and intron gene structures. Expression analysis showed that barley nsLTPs have diverse expression patterns, revealing their various roles. Our results shed light on the phylogenetic relationships and potential functions of barley nsLTPs and will be useful for future studies of barley development and molecular breeding.
基金supported by the National Key Basic Research Program of China (973 Program No. 2011CB505400)
文摘Differences between healthy subjects and associated disease risks are of substantial interest in clinical medicine. Based on clinical presentations, Traditional Chinese Medicine (TCM) classifies healthy people into nine constitutions: Balanced, Qi, Yang or Yin deficiency, Phlegm-dampness, Damp-heat, Blood stasis, Qi stagnation, and Inherited special constitutions. In particular, Yang and Yin deficiency constitutions exhibit cold and heat aversion, respectively. However, the intrinsic molecular characteristics of unbal- anced phenotypes remain unclear. To determine whether gene expression-based clustering can reca- pitulate TCM-based classification, peripheral blood mononudear cells (PBMCs) were collected from Chinese Han individuals with Yang/Yin deficiency (n = 12 each) and Balanced (n = 8) constitutions, and global gene expression profiles were determined using the Affymetrix HC-UI33A Plus 2.0 array. Notably, we found that gene expression-based classifications reflected distinct TCM-based subtypes. Consistent with the clinical observation that subjects with Yang deficiency tend toward obesity, series-clustering analysis detected several key lipid metabolic genes (diacylglycerol acyltransferase (DGAT2), acyl-CoA synthetase (ACSL1), and ATP-hinding cassette subfamily A member 1 (ABCAI)) to be down- and up- regulated in Yin and Yang deficiency constitutions, respectively. Our findings suggest that Yin]Yang deficiency and Balanced constitutions are unique entities in their mRNA expression profiles. Moreover, the distinct physical and clinical characteristics of each unbalanced constitution can be explained, in part, by specific gene expression signatures.
基金The authors would like to acknowledge the Researchers Supporting Project Number(RSP-2020/287)King Saud University,Riyadh,Saudi Arabia for their support in this work.
文摘Machine-learning algorithms have been widely used in breast cancer diagnosis to help pathologists and physicians in the decision-making process.However,the high dimensionality of genetic data makes the classification process a challenging task.In this paper,we propose a new optimized wrapper gene selection method that is based on a nature-inspired algorithm(simulated annealing(SA)),which will help select the most informative genes for breast cancer prediction.These optimal genes will then be used to train the classifier to improve its accuracy and efficiency.Three supervised machine-learning algorithms,namely,the support vector machine,the decision tree,and the random forest were used to create the classifier models that will help to predict breast cancer.Two different experiments were conducted using three datasets:Gene expression(GE),deoxyribonucleic acid(DNA)methylation,and a combination of the two.Six measures were used to evaluate the performance of the proposed algorithm,which include the following:Accuracy,precision,recall,specificity,area under the curve(AUC),and execution time.The effectiveness of the proposed classifiers was evaluated through comprehensive experiments.The results demonstrated that our approach outperformed the conventional classifiers as expected in terms of accuracy and execution time.High accuracy values of 99.77%,99.45%,and 99.45%have been achieved by SA-SVM for GE,DNA methylation,and the combined datasets,respectively.The execution time of the proposed approach was significantly reduced,in comparison to that of the traditional classifiers and the best execution time has been reached by SA-SVM,which was 0.02,0.03,and 0.02 on GE,DNA methylation,and the combined datasets respectively.In regard to precision and specificity,SA-RF obtained the best result of 100 on GE dataset.While SA-SVM attained the best recall result of 100 on GE dataset.
文摘Classification of gene expression data is a pivotal research area that plays a substantial role in diagnosis and prediction of diseases. Generally, feature selection is one of the extensively used techniques in data mining approaches, especially in classification. Gene expression data are usually composed of dozens of samples characterized by thousands of genes. This increases the dimensionality coupled with the existence of irrelevant and redundant features. Accordingly, the selection of informative genes (features) becomes difficult, which badly affects the gene classification accuracy. In this paper, we consider the feature selection for classifying gene expression microarray datasets. The goal is to detect the most possibly cancer-related genes in a distributed manner, which helps in effectively classifying the samples. Initially, the available huge amount of considered features are subdivided and distributed among several processors. Then, a new filter selection method based on a fuzzy inference system is applied to each subset of the dataset. Finally, all the resulted features are ranked, then a wrapper-based selection method is applied. Experimental results showed that our proposed feature selection technique performs better than other techniques since it produces lower time latency and improves classification performance.
文摘Background Early stage (FIGO stage Ⅰ-Ⅱ) endometrioid endometrial adenocarcinoma (EEA) is very common in clinical practice.However,patients with the early stage EEA show various clinical behaviors due to biological heterogeneity.Hence,we aimed to discover distinct classes of tumors based on gene expression profiling,and analyze whether the molecular classification correlated with the histopathological stages or other clinical parameters.Methods Hierarchical clustering was performed for class discovery in 28 eady stage EEA samples using a special cDNA microarray chip containing 492 genes designed for endometrial cancer.Correlations between clinicopathologic parameters and our classification were analyzed.And the significance analysis of microarrays (SAM) array was used to identify the signature genes according to the tumor grade and myometrial invasion.Results Three tumor subtypes (subtypes Ⅰ,Ⅱ and Ⅲ) were identified by hierarchical clustering,each subtype had different clinicopathological factors,such as tumor grade,myometrial invasion status,and FIGO stage.Moreover,SAM analysis showed 34 up-regulated genes in high grade tumors,and 38 up-regulated genes and 1 down-regulated in deep myometrial invasive tumors.The overlap genes between these two high-risk factors were markedly up-regulated in subtype Ⅰ,but down-regulated in subtype Ⅲ.Conclusion We have identified novel molecular subtypes in early stage EEA.Differential gene signatures characterize each tumor subtype,which could be used for recognizing the tumor risk and providing a basis for further treatment stratification.
文摘To determine cancer pathway activities in nine types of primary tumors and NCI60 cell lines, we applied an in silico approach by examining gene signatures reflective of consequent pathway activation using gene expression data. Supervised learning approaches predicted that the Ras pathway is active in -70% of lung adenocarcinomas but inactive in most squamous cell carcinomas, pulmonary carcinoids, and small cell lung carcinomas. In contrast, the TGF-β, TNF-α, Src, Myc, E2F3, and β-catenin pathways are inactive in lung adenocarcinomas. We predicted an active Ras, Myc, Src, and/or E2F3 pathway in significant percentages of breast cancer, colorectal carcinoma, and gliomas. Our results also suggest that Ras may be the most prevailing oncogenic pathway. Additionally, many NCI60 cell lines exhibited a gene signature indicative of an active Ras, Myc, and/or Src, but not E2F3, β-catenin, TNF-α, or TGF-β pathway. To our knowledge, this is the first comprehensive survey of cancer pathway activities in nine major tumor types and the most widely used NCI60 cell lines. The "gene expression pathway signatures" we have defined could facilitate the understanding of molecular mechanisms in cancer development and provide guidance to the selection of appropriate cell lines for cancer research and pharmaceutical compound screening.