Background: A random multiple-regression model that simultaneously fit all allele substitution effects for additive markers or haplotypes as uncorrelated random effects was proposed for Best Linear Unbiased Predictio...Background: A random multiple-regression model that simultaneously fit all allele substitution effects for additive markers or haplotypes as uncorrelated random effects was proposed for Best Linear Unbiased Prediction, using whole-genome data. Leave-one-out cross validation can be used to quantify the predictive ability of a statistical model.Methods: Naive application of Leave-one-out cross validation is computationally intensive because the training and validation analyses need to be repeated n times, once for each observation. Efficient Leave-one-out cross validation strategies are presented here, requiring little more effort than a single analysis.Results: Efficient Leave-one-out cross validation strategies is 786 times faster than the naive application for a simulated dataset with 1,000 observations and 10,000 markers and 99 times faster with 1,000 observations and 100 markers. These efficiencies relative to the naive approach using the same model will increase with increases in the number of observations.Conclusions: Efficient Leave-one-out cross validation strategies are presented here, requiring little more effort than a single analysis.展开更多
Feature selection is a key task in statistical pattern recognition. Most feature selection algorithms have been proposed based on specific objective functions which are usually intuitively reasonable but can sometimes...Feature selection is a key task in statistical pattern recognition. Most feature selection algorithms have been proposed based on specific objective functions which are usually intuitively reasonable but can sometimes be far from the more basic objectives of the feature selection. This paper describes how to select features such that the basic objectives, e.g., classification or clustering accuracies, can be optimized in a more direct way. The analysis requires that the contribution of each feature to the evaluation metrics can be quantitatively described by some score function. Motivated by the conditional independence structure in probabilistic distributions, the analysis uses a leave-one-out feature selection algorithm which provides an approximate solution. The leave-one- out algorithm improves the conventional greedy backward elimination algorithm by preserving more interactions among features in the selection process, so that the various feature selection objectives can be optimized in a unified way. Experiments on six real-world datasets with different feature evaluation metrics have shown that this algorithm outperforms popular feature selection algorithms in most situations.展开更多
As the solutions of the least squares support vector regression machine (LS-SVRM) are not sparse, it leads to slow prediction speed and limits its applications. The defects of the ex- isting adaptive pruning algorit...As the solutions of the least squares support vector regression machine (LS-SVRM) are not sparse, it leads to slow prediction speed and limits its applications. The defects of the ex- isting adaptive pruning algorithm for LS-SVRM are that the training speed is slow, and the generalization performance is not satis- factory, especially for large scale problems. Hence an improved algorithm is proposed. In order to accelerate the training speed, the pruned data point and fast leave-one-out error are employed to validate the temporary model obtained after decremental learning. The novel objective function in the termination condition which in- volves the whole constraints generated by all training data points and three pruning strategies are employed to improve the generali- zation performance. The effectiveness of the proposed algorithm is tested on six benchmark datasets. The sparse LS-SVRM model has a faster training speed and better generalization performance.展开更多
The water quality grades of phosphate(PO4-P) and dissolved inorganic nitrogen(DIN) are integrated by spatial partitioning to fit the global and local semi-variograms of these nutrients. Leave-one-out cross validat...The water quality grades of phosphate(PO4-P) and dissolved inorganic nitrogen(DIN) are integrated by spatial partitioning to fit the global and local semi-variograms of these nutrients. Leave-one-out cross validation is used to determine the statistical inference method. To minimize absolute average errors and error mean squares,stratified Kriging(SK) interpolation is applied to DIN and ordinary Kriging(OK) interpolation is applied to PO4-P.Ten percent of the sites is adjusted by considering their impact on the change in deviations in DIN and PO4-P interpolation and the resultant effect on areas with different water quality grades. Thus, seven redundant historical sites are removed. Seven historical sites are distributed in areas with water quality poorer than Grade IV at the north and south branches of the Changjiang(Yangtze River) Estuary and at the coastal region north of the Hangzhou Bay. Numerous sites are installed in these regions. The contents of various elements in the waters are not remarkably changed, and the waters are mixed well. Seven sites that have been optimized and removed are set to water with quality Grades III and IV. Optimization and adjustment of unrestricted areas show that the optimized and adjusted sites are mainly distributed in regions where the water quality grade undergoes transition.Therefore, key sites for adjustment and optimization are located at the boundaries of areas with different water quality grades and seawater.展开更多
The process of selecting an artificial intelligence(AI)model to assist clinical diagnosis of a particular pathology and its validation tests is relevant since the values of accuracy,sensitivity and specificity may not...The process of selecting an artificial intelligence(AI)model to assist clinical diagnosis of a particular pathology and its validation tests is relevant since the values of accuracy,sensitivity and specificity may not reflect the behavior of the method in a real environment.Here,we provide helpful considerations to increase the success of using an AI model in clinical practice.展开更多
<em>Objective:</em> To establish a practical method for discriminating dementia groups and healthy elderlies, by using scalp-recorded electroencephalograms (EEGs). <em>Methods:</em> 16-ch EEGs ...<em>Objective:</em> To establish a practical method for discriminating dementia groups and healthy elderlies, by using scalp-recorded electroencephalograms (EEGs). <em>Methods:</em> 16-ch EEGs were recorded during resting state for 39 dementia groups and 11 healthy elderlies. The connectivity between any two electrodes was estimated by synchronization likelihood (SL). The brain networks were constructed by normalized SL values. The present leave-one-out cross validation (LOOCV) required the Euclidean distance between any two subjects having 120-dimensional vectors concerned with the SL values for six frequency bands. In order to investigate factors which would affect the LOOCV results, principal component analysis (PCA) was applied to all the subjects. <em>Results:</em> The accuracy for the upper alpha yielded more than 80% and 70% in the dementia groups and the healthy elderlies, respectively. The LOOCV result could be explained in terms of brain networks such as executive control network (ECN) and default mode network (DMN) characterized by factor loadings of principal components. <em>Conclusions:</em> Dementia groups and healthy elderlies could be characterized by principal components of SL values between all the electrode pairs, even less connections, which revealed disruption and preservation of DMN and ECN. <em>Significance:</em> This study will provide a simple and practical method for discriminating dementia groups from healthy elderlies by scalp-recorded EEGs.展开更多
Accurate estimate of tree biomass is essential for forest management.In recent years,several climate-sensitive allometric biomass models with diameter at breast height(D)as a predictor have been proposed for various t...Accurate estimate of tree biomass is essential for forest management.In recent years,several climate-sensitive allometric biomass models with diameter at breast height(D)as a predictor have been proposed for various tree species and climate zones to estimate tree aboveground biomass(AGB).But the allometric models only account for the potential effects of climate on tree biomass and do not simultaneously explain the influence of climate on D growth.In this study,based on the AGB data from 256 destructively sampled trees of three larch species randomly distributed across the five secondary climate zones in northeastern and northern China,we first developed a climate-sensitive AGB base model and a climate-sensitive D growth base model using a nonlinear least square regression separately.A compatible simultaneous model system was then developed with the climate-sensitive AGB and D growth models using a nonlinear seemingly unrelated regression.The potential effects of several temperature and precipitation variables on AGB and D growth were evaluated.The fitting results of climatic sensitive base models were compared against those of their compatible simultaneous model system.It was found that a decreased isothermality([mean of monthly(maximum temperatureminimum temperature)]/(Maximum temperature of the warmest month-Minimum temperature of the coldest month))and total growing season precipitation,and increased annual precipitation significantly increased the values of AGB;an increase of temperature seasonality(a standard deviation of the mean monthly temperature)and precipitation seasonality(a standard deviation of the mean monthly precipitation)could lead to the increase of D.The differences of the model fitting results between the compatible simultaneous system with the consideration of climate effects on both AGB and D growth and its corresponding climate-sensitive AGB and D growth base models were very small and insignificant(p>0.05).Compared to the base models,the inhere nt correlation of AGB with D was taken into account effectively by the proposed compatible model system developed with the climate-sensitive AGB and D grow th models.In addition,the compatible properties of the estimated AGB and D were also addressed substantially in the proposed model system.展开更多
Prediction of the biodegradability of organic pollutants is an ecologically desirable and economically feasible tool for estimating the environmental fate of chemicals. In this paper,stepwise multiple linear regressio...Prediction of the biodegradability of organic pollutants is an ecologically desirable and economically feasible tool for estimating the environmental fate of chemicals. In this paper,stepwise multiple linear regression analysis method was applied to establish quantitative structure biodegradability relationship(QSBR) between the chemical structure and a novel biodegradation activity index(qmax) of 20 polycyclic aromatic hydrocarbons(PAHs). The frequency B3LYP/6-311+G(2df,p) calculations showed no imaginary values, implying that all the structures are minima on the potential energy surface. After eliminating the parameters which had low related coefficient with qmax, the major descriptors influencing the biodegradation activity were screened to be Freq, D, MR, EHOMOand To IE. The evaluation of the developed QSBR mode, using a leave-one-out cross-validation procedure, showed that the relationships are significant and the model had good robustness and predictive ability. The results would be helpful for understanding the mechanisms governing biodegradation at the molecular level.展开更多
Strike and dip are essential to the description of geological features and therefore play important roles in 3D geological modeling.Unevenly and sparsely measured orientations from geological field mapping pose proble...Strike and dip are essential to the description of geological features and therefore play important roles in 3D geological modeling.Unevenly and sparsely measured orientations from geological field mapping pose problems for the geological modeling,especially for covered and deep areas.This study developed a new method for estimating strike and dip based on structural expansion orientation,which can be automatically extracted from both geological and geophysical maps or profiles.Specifically,strike and dip can be estimated by minimizing an objective function composed of the included angle between the strike and dip and the leave-one-out cross-validation strike and dip.We used angle parameterization to reduce dimensionality and proposed a quasi-gradient descent(QGD)method to rapidly obtain a near-optimal solution,improving the time-efficiency and accuracy of objective function optimization with the particle swarm method.A synthetic basin fold model was subsequently used to test the proposed method,and the results showed that the strike and dip estimates were close to the true values.Finally,the proposed method was applied to a real fold structure largely covered by Cainozoic sediments in Australia.The strikes and dips estimated by the proposed method conformed to the actual geological structures more than those of the vector interpolation method did.As expected,the results of 3D geological implicit interface modeling and the strike and dip vector field were much improved by the addition of estimated strikes and dips.展开更多
Background:Increasing evidences indicate that microRNAs (miRNAs) are functionally related to the development and progression of various human diseases.Inferring disease-related miRNAs can be helpful in promoting disea...Background:Increasing evidences indicate that microRNAs (miRNAs) are functionally related to the development and progression of various human diseases.Inferring disease-related miRNAs can be helpful in promoting disease biomarker detection for the treatment,diagnosis,and prevention of complex diseases.Methods:To improve the prediction accuracy of miRNA-disease association and capture more potential diseaserelated miRNAs,we constructed a precise miRNA global similarity network (MSFSN) via calculating the miRNA similarity based on secondary structures,families,and functions.Results:We tested the network on the classical algorithms:WBSMDA and RWRMDA through the method of leaveone- out cross-validation.Eventually,AUCs of 0.8212 and 0.9657 are obtained,respectively.Also,the proposed MSFSN is applied to three cancers for breast neoplasms,hepatocellular carcinoma,and prostate neoplasms.Consequently,82%,76%,and 82% of the top 50 potential miRNAs for these diseases are respectively validated by the miRNA-disease associations database miR2Disease and oncomiRDB.Conclusion:Therefore,MSFSN provides a novel miRNA similarity network combining precise function network with global structure network of miRNAs to predict the associations between miRNAs and diseases in various models.展开更多
基金supported by the US Department of Agriculture,Agriculture and Food Research Initiative National Institute of Food and Agriculture Competitive grant no.2015-67015-22947
文摘Background: A random multiple-regression model that simultaneously fit all allele substitution effects for additive markers or haplotypes as uncorrelated random effects was proposed for Best Linear Unbiased Prediction, using whole-genome data. Leave-one-out cross validation can be used to quantify the predictive ability of a statistical model.Methods: Naive application of Leave-one-out cross validation is computationally intensive because the training and validation analyses need to be repeated n times, once for each observation. Efficient Leave-one-out cross validation strategies are presented here, requiring little more effort than a single analysis.Results: Efficient Leave-one-out cross validation strategies is 786 times faster than the naive application for a simulated dataset with 1,000 observations and 10,000 markers and 99 times faster with 1,000 observations and 100 markers. These efficiencies relative to the naive approach using the same model will increase with increases in the number of observations.Conclusions: Efficient Leave-one-out cross validation strategies are presented here, requiring little more effort than a single analysis.
基金National Natural Science Foundation of China(Nos.61071131 and 61271388)Beijing Natural Science Foundation(No.4122040)+1 种基金Research Project of Tsinghua University(No.2012Z01011)Doctoral Fund of the Ministry of Education of China(No.20120002110036)
文摘Feature selection is a key task in statistical pattern recognition. Most feature selection algorithms have been proposed based on specific objective functions which are usually intuitively reasonable but can sometimes be far from the more basic objectives of the feature selection. This paper describes how to select features such that the basic objectives, e.g., classification or clustering accuracies, can be optimized in a more direct way. The analysis requires that the contribution of each feature to the evaluation metrics can be quantitatively described by some score function. Motivated by the conditional independence structure in probabilistic distributions, the analysis uses a leave-one-out feature selection algorithm which provides an approximate solution. The leave-one- out algorithm improves the conventional greedy backward elimination algorithm by preserving more interactions among features in the selection process, so that the various feature selection objectives can be optimized in a unified way. Experiments on six real-world datasets with different feature evaluation metrics have shown that this algorithm outperforms popular feature selection algorithms in most situations.
基金supported by the National Natural Science Foundation of China (61074127)
文摘As the solutions of the least squares support vector regression machine (LS-SVRM) are not sparse, it leads to slow prediction speed and limits its applications. The defects of the ex- isting adaptive pruning algorithm for LS-SVRM are that the training speed is slow, and the generalization performance is not satis- factory, especially for large scale problems. Hence an improved algorithm is proposed. In order to accelerate the training speed, the pruned data point and fast leave-one-out error are employed to validate the temporary model obtained after decremental learning. The novel objective function in the termination condition which in- volves the whole constraints generated by all training data points and three pruning strategies are employed to improve the generali- zation performance. The effectiveness of the proposed algorithm is tested on six benchmark datasets. The sparse LS-SVRM model has a faster training speed and better generalization performance.
基金The National Natural Science Fundation of China under contract Nos 41376190,41271404,41531179,41421001 and41601425the Open Funds of the Key Laboratory of Integrated Monitoring and Applied Technologies for Marin Harmful Algal Blooms,SOA under contract No.MATHA201120204+1 种基金the Scientific Research Project of Shanghai Marine Bureau under contract No.Hu Hai Ke2016-05the Ocean Public Welfare Scientific Research Project,State Oceanic Administration of the People's Republic of China under contract Nos 201305027 and 201505008
文摘The water quality grades of phosphate(PO4-P) and dissolved inorganic nitrogen(DIN) are integrated by spatial partitioning to fit the global and local semi-variograms of these nutrients. Leave-one-out cross validation is used to determine the statistical inference method. To minimize absolute average errors and error mean squares,stratified Kriging(SK) interpolation is applied to DIN and ordinary Kriging(OK) interpolation is applied to PO4-P.Ten percent of the sites is adjusted by considering their impact on the change in deviations in DIN and PO4-P interpolation and the resultant effect on areas with different water quality grades. Thus, seven redundant historical sites are removed. Seven historical sites are distributed in areas with water quality poorer than Grade IV at the north and south branches of the Changjiang(Yangtze River) Estuary and at the coastal region north of the Hangzhou Bay. Numerous sites are installed in these regions. The contents of various elements in the waters are not remarkably changed, and the waters are mixed well. Seven sites that have been optimized and removed are set to water with quality Grades III and IV. Optimization and adjustment of unrestricted areas show that the optimized and adjusted sites are mainly distributed in regions where the water quality grade undergoes transition.Therefore, key sites for adjustment and optimization are located at the boundaries of areas with different water quality grades and seawater.
基金Supported by SEDENA Budgetary Program,No.A022-2021.
文摘The process of selecting an artificial intelligence(AI)model to assist clinical diagnosis of a particular pathology and its validation tests is relevant since the values of accuracy,sensitivity and specificity may not reflect the behavior of the method in a real environment.Here,we provide helpful considerations to increase the success of using an AI model in clinical practice.
文摘<em>Objective:</em> To establish a practical method for discriminating dementia groups and healthy elderlies, by using scalp-recorded electroencephalograms (EEGs). <em>Methods:</em> 16-ch EEGs were recorded during resting state for 39 dementia groups and 11 healthy elderlies. The connectivity between any two electrodes was estimated by synchronization likelihood (SL). The brain networks were constructed by normalized SL values. The present leave-one-out cross validation (LOOCV) required the Euclidean distance between any two subjects having 120-dimensional vectors concerned with the SL values for six frequency bands. In order to investigate factors which would affect the LOOCV results, principal component analysis (PCA) was applied to all the subjects. <em>Results:</em> The accuracy for the upper alpha yielded more than 80% and 70% in the dementia groups and the healthy elderlies, respectively. The LOOCV result could be explained in terms of brain networks such as executive control network (ECN) and default mode network (DMN) characterized by factor loadings of principal components. <em>Conclusions:</em> Dementia groups and healthy elderlies could be characterized by principal components of SL values between all the electrode pairs, even less connections, which revealed disruption and preservation of DMN and ECN. <em>Significance:</em> This study will provide a simple and practical method for discriminating dementia groups from healthy elderlies by scalp-recorded EEGs.
基金supported by the Thirteenth Five-year Plan Pioneering project of High Technology Plan of the National Department of Technology(No.2017YFC0503906)the Natural Science Foundation of Beijing(No.5184036)the Project for Science and Technology Open Cooperation of Henan Province(172106000071)the Chinese National Natural Science Foundations(Grant Nos.31470641,31300534 and 31570628).We also appreciate the valuable comments and constructive suggestions from two anonymous referees and the Associate Editor who helped improve the manuscript.Z.Gao,Q.Wang and Z.Hu authors contributed equally to this work.
文摘Accurate estimate of tree biomass is essential for forest management.In recent years,several climate-sensitive allometric biomass models with diameter at breast height(D)as a predictor have been proposed for various tree species and climate zones to estimate tree aboveground biomass(AGB).But the allometric models only account for the potential effects of climate on tree biomass and do not simultaneously explain the influence of climate on D growth.In this study,based on the AGB data from 256 destructively sampled trees of three larch species randomly distributed across the five secondary climate zones in northeastern and northern China,we first developed a climate-sensitive AGB base model and a climate-sensitive D growth base model using a nonlinear least square regression separately.A compatible simultaneous model system was then developed with the climate-sensitive AGB and D growth models using a nonlinear seemingly unrelated regression.The potential effects of several temperature and precipitation variables on AGB and D growth were evaluated.The fitting results of climatic sensitive base models were compared against those of their compatible simultaneous model system.It was found that a decreased isothermality([mean of monthly(maximum temperatureminimum temperature)]/(Maximum temperature of the warmest month-Minimum temperature of the coldest month))and total growing season precipitation,and increased annual precipitation significantly increased the values of AGB;an increase of temperature seasonality(a standard deviation of the mean monthly temperature)and precipitation seasonality(a standard deviation of the mean monthly precipitation)could lead to the increase of D.The differences of the model fitting results between the compatible simultaneous system with the consideration of climate effects on both AGB and D growth and its corresponding climate-sensitive AGB and D growth base models were very small and insignificant(p>0.05).Compared to the base models,the inhere nt correlation of AGB with D was taken into account effectively by the proposed compatible model system developed with the climate-sensitive AGB and D grow th models.In addition,the compatible properties of the estimated AGB and D were also addressed substantially in the proposed model system.
基金supported by the State Key Laboratory of Urban Water Resource and Environment, Harbin Institute of Technology (No. 2013DX10)the Sino-Dutch Research Program (No. zhmhgfs2011-001)the Sino-American Coal Chemical Industry Program (No. ZMAGZ 2011001)
文摘Prediction of the biodegradability of organic pollutants is an ecologically desirable and economically feasible tool for estimating the environmental fate of chemicals. In this paper,stepwise multiple linear regression analysis method was applied to establish quantitative structure biodegradability relationship(QSBR) between the chemical structure and a novel biodegradation activity index(qmax) of 20 polycyclic aromatic hydrocarbons(PAHs). The frequency B3LYP/6-311+G(2df,p) calculations showed no imaginary values, implying that all the structures are minima on the potential energy surface. After eliminating the parameters which had low related coefficient with qmax, the major descriptors influencing the biodegradation activity were screened to be Freq, D, MR, EHOMOand To IE. The evaluation of the developed QSBR mode, using a leave-one-out cross-validation procedure, showed that the relationships are significant and the model had good robustness and predictive ability. The results would be helpful for understanding the mechanisms governing biodegradation at the molecular level.
基金supported by the National Key Research and Development Program of China(No.2019YFC0605102)the National Natural Science Foundation of China(Grant No.41972307).
文摘Strike and dip are essential to the description of geological features and therefore play important roles in 3D geological modeling.Unevenly and sparsely measured orientations from geological field mapping pose problems for the geological modeling,especially for covered and deep areas.This study developed a new method for estimating strike and dip based on structural expansion orientation,which can be automatically extracted from both geological and geophysical maps or profiles.Specifically,strike and dip can be estimated by minimizing an objective function composed of the included angle between the strike and dip and the leave-one-out cross-validation strike and dip.We used angle parameterization to reduce dimensionality and proposed a quasi-gradient descent(QGD)method to rapidly obtain a near-optimal solution,improving the time-efficiency and accuracy of objective function optimization with the particle swarm method.A synthetic basin fold model was subsequently used to test the proposed method,and the results showed that the strike and dip estimates were close to the true values.Finally,the proposed method was applied to a real fold structure largely covered by Cainozoic sediments in Australia.The strikes and dips estimated by the proposed method conformed to the actual geological structures more than those of the vector interpolation method did.As expected,the results of 3D geological implicit interface modeling and the strike and dip vector field were much improved by the addition of estimated strikes and dips.
基金Major Research Plan of National Natural Science Foundation of China (No.91730301)Key Projects of National Natural Science Foundation of China (No.l 1831015)the State Scholarship Fund of China (No.201806790020).
文摘Background:Increasing evidences indicate that microRNAs (miRNAs) are functionally related to the development and progression of various human diseases.Inferring disease-related miRNAs can be helpful in promoting disease biomarker detection for the treatment,diagnosis,and prevention of complex diseases.Methods:To improve the prediction accuracy of miRNA-disease association and capture more potential diseaserelated miRNAs,we constructed a precise miRNA global similarity network (MSFSN) via calculating the miRNA similarity based on secondary structures,families,and functions.Results:We tested the network on the classical algorithms:WBSMDA and RWRMDA through the method of leaveone- out cross-validation.Eventually,AUCs of 0.8212 and 0.9657 are obtained,respectively.Also,the proposed MSFSN is applied to three cancers for breast neoplasms,hepatocellular carcinoma,and prostate neoplasms.Consequently,82%,76%,and 82% of the top 50 potential miRNAs for these diseases are respectively validated by the miRNA-disease associations database miR2Disease and oncomiRDB.Conclusion:Therefore,MSFSN provides a novel miRNA similarity network combining precise function network with global structure network of miRNAs to predict the associations between miRNAs and diseases in various models.