Near infrared reflectance spectroscopy (NIRS), a non-destructive measurement technique, was combined with partial least squares regression discrimiant analysis (PLS-DA) to discriminate the transgenic (TCTP and mi...Near infrared reflectance spectroscopy (NIRS), a non-destructive measurement technique, was combined with partial least squares regression discrimiant analysis (PLS-DA) to discriminate the transgenic (TCTP and mi166) and wild type (Zhonghua 11) rice. Furthermore, rice lines transformed with protein gene (OsTCTP) and regulation gene (Osmi166) were also discriminated by the NIRS method. The performances of PLS-DA in spectral ranges of 4 000-8 000 cm-1 and 4 000-10 000 cm-1 were compared to obtain the optimal spectral range. As a result, the transgenic and wild type rice were distinguished from each other in the range of 4 000-10 000 cm-1, and the correct classification rate was 100.0% in the validation test. The transgenic rice TCTP and mi166 were also distinguished from each other in the range of 4 000-10 000 cm-1, and the correct classification rate was also 100.0%. In conclusion, NIRS combined with PLS-DA can be used for the discrimination of transgenic rice.展开更多
The computer auxiliary partial least squares is introduced to simultaneously determine the contents of Deoxyschizandin, Schisandrin, r-Schisandrin in the extracted solution of wuweizi. Regression analysis of the exper...The computer auxiliary partial least squares is introduced to simultaneously determine the contents of Deoxyschizandin, Schisandrin, r-Schisandrin in the extracted solution of wuweizi. Regression analysis of the experimental results shows that the average recovery of each component is all in the range from 98.9% to 110.3% , which means the partial least squares regression spectrophotometry can circumvent the overlappirtg of absorption spectrums of mlulti-components, so that sctisfactory results can be obtained without any scrapple pre-separation.展开更多
With recent advances in biotechnology, genome-wide association study (GWAS) has been widely used to identify genetic variants that underlie human complex diseases and traits. In case-control GWAS, typical statistica...With recent advances in biotechnology, genome-wide association study (GWAS) has been widely used to identify genetic variants that underlie human complex diseases and traits. In case-control GWAS, typical statistical strategy is traditional logistical regression (LR) based on single-locus analysis. However, such a single-locus analysis leads to the well-known multiplicity problem, with a risk of inflating type I error and reducing power. Dimension reduction-based techniques, such as principal component-based logistic regression (PC-LR), partial least squares-based logistic regression (PLS-LR), have recently gained much attention in the analysis of high dimensional genomic data. However, the perfor- mance of these methods is still not clear, especially in GWAS. We conducted simulations and real data application to compare the type I error and power of PC-LR, PLS-LR and LR applicable to GWAS within a defined single nucleotide polymorphism (SNP) set region. We found that PC-LR and PLS can reasonably control type I error under null hypothesis. On contrast, LR, which is corrected by Bonferroni method, was more conserved in all simulation settings. In particular, we found that PC-LR and PLS-LR had comparable power and they both outperformed LR, especially when the causal SNP was in high linkage disequilibrium with genotyped ones and with a small effective size in simulation. Based on SNP set analysis, we applied all three methods to analyze non-small cell lung cancer GWAS data.展开更多
During the course of calculating the rice evapotranspiration using weather factors,we often find that some independent variables have multiple correlation.The phenomena can lead to the traditional multivariate regress...During the course of calculating the rice evapotranspiration using weather factors,we often find that some independent variables have multiple correlation.The phenomena can lead to the traditional multivariate regression model which based on least square method distortion.And the stability of the model will be lost.The model will be built based on partial least square regression in the paper,through applying the idea of main component analyze and typical correlation analyze,the writer picks up some component from original material.Thus,the writer builds up the model of rice evapotranspiration to solve the multiple correlation among the independent variables (some weather factors).At last,the writer analyses the model in some parts,and gains the satisfied result.展开更多
Based on continuum power regression(CPR) method, a novel derivation of kernel partial least squares(named CPR-KPLS) regression is proposed for approximating arbitrary nonlinear functions.Kernel function is used to map...Based on continuum power regression(CPR) method, a novel derivation of kernel partial least squares(named CPR-KPLS) regression is proposed for approximating arbitrary nonlinear functions.Kernel function is used to map the input variables(input space) into a Reproducing Kernel Hilbert Space(so called feature space),where a linear CPR-PLS is constructed based on the projection of explanatory variables to latent variables(components). The linear CPR-PLS in the high-dimensional feature space corresponds to a nonlinear CPR-KPLS in the original input space. This method offers a novel extension for kernel partial least squares regression(KPLS),and some numerical simulation results are presented to illustrate the feasibility of the proposed method.展开更多
Laser-induced breakdown spectroscopy(LIBS)has become a widely used atomic spectroscopic technique for rapid coal analysis.However,the vast amount of spectral information in LIBS contains signal uncertainty,which can a...Laser-induced breakdown spectroscopy(LIBS)has become a widely used atomic spectroscopic technique for rapid coal analysis.However,the vast amount of spectral information in LIBS contains signal uncertainty,which can affect its quantification performance.In this work,we propose a hybrid variable selection method to improve the performance of LIBS quantification.Important variables are first identified using Pearson's correlation coefficient,mutual information,least absolute shrinkage and selection operator(LASSO)and random forest,and then filtered and combined with empirical variables related to fingerprint elements of coal ash content.Subsequently,these variables are fed into a partial least squares regression(PLSR).Additionally,in some models,certain variables unrelated to ash content are removed manually to study the impact of variable deselection on model performance.The proposed hybrid strategy was tested on three LIBS datasets for quantitative analysis of coal ash content and compared with the corresponding data-driven baseline method.It is significantly better than the variable selection only method based on empirical knowledge and in most cases outperforms the baseline method.The results showed that on all three datasets the hybrid strategy for variable selection combining empirical knowledge and data-driven algorithms achieved the lowest root mean square error of prediction(RMSEP)values of 1.605,3.478 and 1.647,respectively,which were significantly lower than those obtained from multiple linear regression using only 12 empirical variables,which are 1.959,3.718 and 2.181,respectively.The LASSO-PLSR model with empirical support and 20 selected variables exhibited a significantly improved performance after variable deselection,with RMSEP values dropping from 1.635,3.962 and 1.647 to 1.483,3.086 and 1.567,respectively.Such results demonstrate that using empirical knowledge as a support for datadriven variable selection can be a viable approach to improve the accuracy and reliability of LIBS quantification.展开更多
Estimating wheat grain protein content by remote sensing is important for assessing wheat quality at maturity and making grains harvest and purchase policies. However, spatial variability of soil condition, temperatur...Estimating wheat grain protein content by remote sensing is important for assessing wheat quality at maturity and making grains harvest and purchase policies. However, spatial variability of soil condition, temperature, and precipitation will affect grain protein contents and these factors usually cannot be monitored accurately by remote sensing data from single image. In this research, the relationships between wheat protein content at maturity and wheat agronomic parameters at different growing stages were analyzed and multi-temporal images of Landsat TM were used to estimate grain protein content by partial least squares regression. Experiment data were acquired in the suburb of Beijing during a 2-yr experiment in the period from 2003 to 2004. Determination coefficient, average deviation of self-modeling, and deviation of cross- validation were employed to assess the estimation accuracy of wheat grain protein content. Their values were 0.88, 1.30%, 3.81% and 0.72, 5.22%, 12.36% for 2003 and 2004, respectively. The research laid an agronomic foundation for GPC (grain protein content) estimation by multi-temporal remote sensing. The results showed that it is feasible to estimate GPC of wheat from multi-temporal remote sensing data in large area.展开更多
Accurate assessment of canopy carotenoid content(CC_(x+c)C)in crops is central to monitor physiological conditions in plants and vegetation stress,and consequently supporting agronomic decisions.However,due to the ove...Accurate assessment of canopy carotenoid content(CC_(x+c)C)in crops is central to monitor physiological conditions in plants and vegetation stress,and consequently supporting agronomic decisions.However,due to the overlap of absorption peaks of carotenoid(C_(x+c))and chlorophyll(C_(a+b)),accurate estimation of carotenoid using reflectance where carotenoid absorb is challenging.The objective of present study was to assess CC_(x+c)C in winter wheat(Triticum aestivum L.)with ground-and aircraft-based hyperspectral measurements in the visible and near-infrared spectrum.In-situ hyperspectral reflectance were measured and airborne hyperspectral data were acquired during major growth stages of winter wheat in five consecutive field experiments.At the canopy level,a remarkable linear relationship(R^(2)=0.95,p<0.001)existed between C_(x+c) and Ca+b,and correlation between CC_(x+c)C and wavelengths within 400 to 1000 nm range indicated that CC_(x+c)C could be estimated using reflectance ranging from visible to near-infrared wavebands.Results of Cx+c assessment based on chlorophyll and carotenoid indices showed that red edge chlorophyll index(CI red edge)performed with the highest accuracy(R^(2)=0.77,RMSE=22.27μg/cm^(2),MAE=4.97μg/cm^(2)).Applying partial least square regression(PLSR)in CC_(x+c)C retrieval emphasized the significance of reflectance within 700 to 750 nm range in CC_(x+c)C assessment.Based on CI red edge index,use of airborne hyperspectral imagery achieved satisfactory results in mapping the spatial distribution of CC_(x+c)C.This study demonstrates that it is feasible to accurately assess CC_(x+c)C in winter wheat with red edge chlorophyll index provided that C_(x+c) correlated well with C_(a+b) at the canopy scale.it is therefore a promising method for CC_(x+c)C retrieval at regional scale from aerial hyperspectral imagery.展开更多
Background Fiber maturity is a key cotton quality property,and its variability in a sample impacts fiber processing and dyeing performance.Currently,the maturity is determined by using established protocols in laborat...Background Fiber maturity is a key cotton quality property,and its variability in a sample impacts fiber processing and dyeing performance.Currently,the maturity is determined by using established protocols in laboratories under a controlled environment.There is an increasing need to measure fiber maturity using low-cost(in general less than $20000)and small portable systems.In this study,a laboratory feasibility was performed to assess the ability of the shortwave infrared hyperspectral imaging(SWIR HSI)technique for determining the conditioned fiber maturity,and as a comparison,a bench-top commercial and expensive(in general greater than $60000)near infrared(NIR)instrument was used.Results Although SWIR HSI and NIR represent different measurement technologies,consistent spectral characteristics were observed between the two instruments when they were used to measure the maturity of the locule fiber samples in seed cotton and of the well-defined fiber samples,respectively.Partial least squares(PLS)models were established using different spectral preprocessing parameters to predict fiber maturity.The high prediction precision was observed by a lower root mean square error of prediction(RMSEP)(<0.046),higher R_(p)^(2)(>0.518),and greater percentage(97.0%)of samples within the 95% agreement range in the entire NIR region(1000-2500 nm)without the moisture band at 1940 nm.Conclusion SWIR HSI has a good potential for assessing cotton fiber maturity in a laboratory environment.展开更多
For optimization of production processes and product quality,often knowledge of the factors influencing the process outcome is compulsory.Thus,process analytical technology(PAT)that allows deeper insight into the proc...For optimization of production processes and product quality,often knowledge of the factors influencing the process outcome is compulsory.Thus,process analytical technology(PAT)that allows deeper insight into the process and results in a mathematical description of the process behavior as a simple function based on the most important process factors can help to achieve higher production efficiency and quality.The present study aims at characterizing a well-known industrial process,the transesterification reaction of rapeseed oil with methanol to produce fatty acid methyl esters(FAME)for usage as biodiesel in a continuous micro reactor set-up.To this end,a design of experiment approach is applied,where the effects of two process factors,the molar ratio and the total flow rate of the reactants,are investigated.The optimized process target response is the FAME mass fraction in the purified nonpolar phase of the product as a measure of reaction yield.The quantification is performed using attenuated total reflection infrared spectroscopy in combination with partial least squares regression.The data retrieved during the conduction of the DoE experimental plan were used for statistical analysis.A non-linear model indicating a synergistic interaction between the studied factors describes the reactor behavior with a high coefficient of determination(R^(2))of 0.9608.Thus,we applied a PAT approach to generate further insight into this established industrial process.展开更多
Soil texture is an indicator of soil physical structure which delivers many ecological functions of soils such as thermal regime, plant growth, and soil quality. However, traditional methods for soil texture measureme...Soil texture is an indicator of soil physical structure which delivers many ecological functions of soils such as thermal regime, plant growth, and soil quality. However, traditional methods for soil texture measurement are time-consuming and labor-intensive. This study attempts to explore an indirect method for rapid estimating the texture of three subgroups of purple soils (i.e. calcareous, neutral, and acidic). 190 topsoil (0 - 10 cm) samples were collected from sloping croplands in Tongnan and Beibei Districts of Chongqing Municipality in China. Vis-NIR spectrum was measured and processed, and stepwise multiple linear regression (SMLR), partial least squares regression (PLSR), and back propagation neural network (BPNN) models were constructed to inform the soil texture. The clay fractions ranged from 4.40% to 27.12% while sand fractions ranged from 0.34% to 36.57%, hereby soil samples encompass three textural classes (i.e. silt, silt loam, and silty clay loam). For the original spectrum, the texture of calcareous and neutral purple soils was not significantly correlated with spectral reflectance and linear models (SMLR and PLSR) exhibited low prediction accuracy. The correlation coefficients and the goodness-of-fits between soil texture and the transformed spectra of all soil groups increased by continuum-removal (CR), first-order differential (R'), and second-order differential (R") transformations. Among them, the R" had the best performance in terms of improving the correlation coefficients and the goodness-of-fits. For the calcareous purple soil, the SMLR exceeds PLSR and BPNN with a higher coefficient of determination (R<sup>2</sup>) and the ratio of performance to inter-quartile distance (RPIQ) values and lower root mean square error of validation (RMSEV), but for the neutral and acidic purple soils, the PLSR model has a better prediction accuracy. In summary, the linear methods (SMLR and PLSR) are more reliable in estimating the texture of the three purple soil groups when using Vis-NIR spectroscopy inversion.展开更多
The water distribution system of one residential district in Tianjin is taken as an example to analyze the changes of water quality.Partial least squares(PLS) regression model,in which the turbidity and Fe are regarde...The water distribution system of one residential district in Tianjin is taken as an example to analyze the changes of water quality.Partial least squares(PLS) regression model,in which the turbidity and Fe are regarded as control objectives,is used to establish the statistical model.The experimental results indicate that the PLS regression model has good predicted results of water quality compared with the monitored data.The percentages of absolute relative error(below 15%,20%,30%) are 44.4%,66.7%,100%(turbidity) and 33.3%,44.4%,77.8%(Fe) on the 4th sampling point;77.8%,88.9%,88.9%(turbidity) and 44.4%,55.6%,66.7%(Fe) on the 5th sampling point.展开更多
With the development of mid-infrared (MIR) photoelectric devices, mid-infrared spectroscopy has become one of the important methods for non-invasive detection of blood glucose. The mid-infrared region (4000 - 400 cm&l...With the development of mid-infrared (MIR) photoelectric devices, mid-infrared spectroscopy has become one of the important methods for non-invasive detection of blood glucose. The mid-infrared region (4000 - 400 cm<sup>-1</sup>) has the well-known fingerprint region (1200 - 800 cm<sup>-1</sup>) of glucose, which has clearer characteristic absorption peaks and better specificity. There is a lot of molecular information about glucose in the MIR. The non-invasive detection of blood glucose by mid-infrared spectroscopy needs to achieve certain accuracy, and the quantitative model is an important factor affecting the accuracy of glucose detection. In this paper, the samples of imitation solution containing only glucose and the samples of imitation mixed solution are taken as the research objects, and the mid-infrared spectral data of the samples are collected. The full spectrum partial least squares Regression (PLSR) model, SNV + Ctr-PLSR model, MSC + Ctr-PLSR model, and convolutional neural networks (CNN) model of 3000 - 900 cm<sup>-1</sup> band were constructed. Full spectrum PLS model and CNN model of 1200 - 900 cm<sup>-1</sup> band were constructed. The experimental results show that the optimal model of the two bands is CNN, then the correlation coefficient of prediction set (Rp) of 3000 - 900 cm<sup>-1</sup> band is 0.95, and the root mean square error of pre-diction set (RMSEP) value is 22.10. The Rp of 1200 - 900 cm<sup>-1</sup> band is 0.95, and the RMSEP value is 22.54. The research results show that CNN is a promising method, which has higher accuracy than PLSR, and is especially suitable for modeling human complex environment. In addition, the study provides a theoretical and practical basis for CNN in feature selection and model interpretation.展开更多
In several LUCC studies, statistical methods are being used to analyze land use data. A problem using conventional statistical methods in land use analysis is that these methods assume the data to be statistically ind...In several LUCC studies, statistical methods are being used to analyze land use data. A problem using conventional statistical methods in land use analysis is that these methods assume the data to be statistically independent. But in fact, they have the tendency to be dependent, a phenomenon known as multicollinearity, especially in the cases of few observations. In this paper, a Partial Least-Squares (PLS) regression approach is developed to study relationships between land use and its influencing factors through a case study of the Suzhou-Wuxi-Changzhou region in China. Multicollinearity exists in the dataset and the number of variables is high compared to the number of observations. Four PLS factors are selected through a preliminary analysis. The correlation analyses between land use and influencing factors demonstrate the land use character of rural industrialization and urbanization in the Suzhou-Wuxi-Changzhou region, meanwhile illustrate that the first PLS factor has enough ability to best describe land use patterns quantitatively, and most of the statistical relations derived from it accord with the fact. By the decreasing capacity of the PLS factors, the reliability of model outcome decreases correspondingly.展开更多
In this paper,we consider the partial linear regression model y_(i)=x_(i)β^(*)+g(ti)+ε_(i),i=1,2,...,n,where(x_(i),ti)are known fixed design points,g(·)is an unknown function,andβ^(*)is an unknown parameter to...In this paper,we consider the partial linear regression model y_(i)=x_(i)β^(*)+g(ti)+ε_(i),i=1,2,...,n,where(x_(i),ti)are known fixed design points,g(·)is an unknown function,andβ^(*)is an unknown parameter to be estimated,random errorsε_(i)are(α,β)-mix_(i)ng random variables.The p-th(p>1)mean consistency,strong consistency and complete consistency for least squares estimators ofβ^(*)and g(·)are investigated under some mild conditions.In addition,a numerical simulation is carried out to study the finite sample performance of the theoretical results.Finally,a real data analysis is provided to further verify the effect of the model.展开更多
Statistical downscaling (SD) analyzes relationship between local-scale response and global-scale predictors. The SD model can be used to forecast rainfall (local-scale) using global-scale precipitation from global cir...Statistical downscaling (SD) analyzes relationship between local-scale response and global-scale predictors. The SD model can be used to forecast rainfall (local-scale) using global-scale precipitation from global circulation model output (GCM). The objectives of this research were to determine the time lag of GCM data and build SD model using PCR method with time lag of the GCM precipitation data. The observations of rainfall data in Indramayu were taken from 1979 to 2007 showing similar patterns with GCM data on 1st grid to 64th grid after time shift (time lag). The time lag was determined using the cross-correlation function. However, GCM data of 64 grids showed multicollinearity problem. This problem was solved by principal component regression (PCR), but the PCR model resulted heterogeneous errors. PCR model was modified to overcome the errors with adding dummy variables to the model. Dummy variables were determined based on partial least squares regression (PLSR). The PCR model with dummy variables improved the rainfall prediction. The SD model with lag-GCM predictors was also better than SD model without lag-GCM.展开更多
Partial least squares(PLS) regression is an important linear regression method that efficiently addresses the multiple correlation problem by combining principal component analysis and multiple regression. In this pap...Partial least squares(PLS) regression is an important linear regression method that efficiently addresses the multiple correlation problem by combining principal component analysis and multiple regression. In this paper, we present a quantum partial least squares(QPLS) regression algorithm. To solve the high time complexity of the PLS regression, we design a quantum eigenvector search method to speed up principal components and regression parameters construction. Meanwhile, we give a density matrix product method to avoid multiple access to quantum random access memory(QRAM)during building residual matrices. The time and space complexities of the QPLS regression are logarithmic in the independent variable dimension n, the dependent variable dimension w, and the number of variables m. This algorithm achieves exponential speed-ups over the PLS regression on n, m, and w. In addition, the QPLS regression inspires us to explore more potential quantum machine learning applications in future works.展开更多
On the basis of experimental observations on animals, applications to clinical data on patients and theoretical statistical reasoning, the author developed a com-puter-assisted general mathematical model of the ‘prob...On the basis of experimental observations on animals, applications to clinical data on patients and theoretical statistical reasoning, the author developed a com-puter-assisted general mathematical model of the ‘probacent’-probability equation, Equation (1) and death rate (mortality probability) equation, Equation (2) derivable from Equation (1) that may be applica-ble as a general approximation method to make use-ful predictions of probable outcomes in a variety of biomedical phenomena [1-4]. Equations (1) and (2) contain a constant, γ and c, respectively. In the pre-vious studies, the author used the least maximum- difference principle to determine these constants that were expected to best fit reported data, minimizing the deviation. In this study, the author uses the method of computer-assisted least sum of squares to determine the constants, γ and c in constructing the ‘probacent’-related formulas best fitting the NCHS- reported data on survival probabilities and death rates in the US total adult population for 2001. The results of this study reveal that the method of com-puter-assisted mathematical analysis with the least sum of squares seems to be simple, more accurate, convenient and preferable than the previously used least maximum-difference principle, and better fit-ting the NCHS-reported data on survival probabili-ties and death rates in the US total adult population. The computer program of curved regression for the ‘probacent’-probability and death rate equations may be helpful in research in biomedicine.展开更多
Near infrared (NIR) hyperspectral imaging measurement of sugar content in peach was introauced. NIR spectral images (650~1 000 nm, resolution: 2 nm) of peach samples were captured with developed hyperspectral im...Near infrared (NIR) hyperspectral imaging measurement of sugar content in peach was introauced. NIR spectral images (650~1 000 nm, resolution: 2 nm) of peach samples were captured with developed hyperspectral imaging setup. Partial least square (PLS) regression prediction model was developed to estimate the sugar content in peach; step-wise backward method was utilized to determine optimal wavelength subsets. Experimental results show that the calibration model with optimal wavelength subsets has a correlation coefficient of prediction of 0.97 and a standard error of prediction of 0.19, the prediction accuracy is higher than the calibration model applied over the whole wavelength, which proves that variable selection plays an important role in improving the prediction accuracy of PLS regression model.展开更多
基金supported by the projects under the Innovation Team of the Safety Standards and Testing Technology for Agricultural Products of Zhejiang Province, China (Grant No.2010R50028)the National Key Technologies R&D Program of China during the 11th Five-Year Plan Period (Grant No.2006BAK02A18)
文摘Near infrared reflectance spectroscopy (NIRS), a non-destructive measurement technique, was combined with partial least squares regression discrimiant analysis (PLS-DA) to discriminate the transgenic (TCTP and mi166) and wild type (Zhonghua 11) rice. Furthermore, rice lines transformed with protein gene (OsTCTP) and regulation gene (Osmi166) were also discriminated by the NIRS method. The performances of PLS-DA in spectral ranges of 4 000-8 000 cm-1 and 4 000-10 000 cm-1 were compared to obtain the optimal spectral range. As a result, the transgenic and wild type rice were distinguished from each other in the range of 4 000-10 000 cm-1, and the correct classification rate was 100.0% in the validation test. The transgenic rice TCTP and mi166 were also distinguished from each other in the range of 4 000-10 000 cm-1, and the correct classification rate was also 100.0%. In conclusion, NIRS combined with PLS-DA can be used for the discrimination of transgenic rice.
文摘The computer auxiliary partial least squares is introduced to simultaneously determine the contents of Deoxyschizandin, Schisandrin, r-Schisandrin in the extracted solution of wuweizi. Regression analysis of the experimental results shows that the average recovery of each component is all in the range from 98.9% to 110.3% , which means the partial least squares regression spectrophotometry can circumvent the overlappirtg of absorption spectrums of mlulti-components, so that sctisfactory results can be obtained without any scrapple pre-separation.
基金founded by the National Natural Science Foundation of China(81202283,81473070,81373102 and81202267)Key Grant of Natural Science Foundation of the Jiangsu Higher Education Institutions of China(10KJA330034 and11KJA330001)+1 种基金the Research Fund for the Doctoral Program of Higher Education of China(20113234110002)the Priority Academic Program for the Development of Jiangsu Higher Education Institutions(Public Health and Preventive Medicine)
文摘With recent advances in biotechnology, genome-wide association study (GWAS) has been widely used to identify genetic variants that underlie human complex diseases and traits. In case-control GWAS, typical statistical strategy is traditional logistical regression (LR) based on single-locus analysis. However, such a single-locus analysis leads to the well-known multiplicity problem, with a risk of inflating type I error and reducing power. Dimension reduction-based techniques, such as principal component-based logistic regression (PC-LR), partial least squares-based logistic regression (PLS-LR), have recently gained much attention in the analysis of high dimensional genomic data. However, the perfor- mance of these methods is still not clear, especially in GWAS. We conducted simulations and real data application to compare the type I error and power of PC-LR, PLS-LR and LR applicable to GWAS within a defined single nucleotide polymorphism (SNP) set region. We found that PC-LR and PLS can reasonably control type I error under null hypothesis. On contrast, LR, which is corrected by Bonferroni method, was more conserved in all simulation settings. In particular, we found that PC-LR and PLS-LR had comparable power and they both outperformed LR, especially when the causal SNP was in high linkage disequilibrium with genotyped ones and with a small effective size in simulation. Based on SNP set analysis, we applied all three methods to analyze non-small cell lung cancer GWAS data.
文摘During the course of calculating the rice evapotranspiration using weather factors,we often find that some independent variables have multiple correlation.The phenomena can lead to the traditional multivariate regression model which based on least square method distortion.And the stability of the model will be lost.The model will be built based on partial least square regression in the paper,through applying the idea of main component analyze and typical correlation analyze,the writer picks up some component from original material.Thus,the writer builds up the model of rice evapotranspiration to solve the multiple correlation among the independent variables (some weather factors).At last,the writer analyses the model in some parts,and gains the satisfied result.
文摘Based on continuum power regression(CPR) method, a novel derivation of kernel partial least squares(named CPR-KPLS) regression is proposed for approximating arbitrary nonlinear functions.Kernel function is used to map the input variables(input space) into a Reproducing Kernel Hilbert Space(so called feature space),where a linear CPR-PLS is constructed based on the projection of explanatory variables to latent variables(components). The linear CPR-PLS in the high-dimensional feature space corresponds to a nonlinear CPR-KPLS in the original input space. This method offers a novel extension for kernel partial least squares regression(KPLS),and some numerical simulation results are presented to illustrate the feasibility of the proposed method.
基金financial supports from National Natural Science Foundation of China(No.62205172)Huaneng Group Science and Technology Research Project(No.HNKJ22-H105)Tsinghua University Initiative Scientific Research Program and the International Joint Mission on Climate Change and Carbon Neutrality。
文摘Laser-induced breakdown spectroscopy(LIBS)has become a widely used atomic spectroscopic technique for rapid coal analysis.However,the vast amount of spectral information in LIBS contains signal uncertainty,which can affect its quantification performance.In this work,we propose a hybrid variable selection method to improve the performance of LIBS quantification.Important variables are first identified using Pearson's correlation coefficient,mutual information,least absolute shrinkage and selection operator(LASSO)and random forest,and then filtered and combined with empirical variables related to fingerprint elements of coal ash content.Subsequently,these variables are fed into a partial least squares regression(PLSR).Additionally,in some models,certain variables unrelated to ash content are removed manually to study the impact of variable deselection on model performance.The proposed hybrid strategy was tested on three LIBS datasets for quantitative analysis of coal ash content and compared with the corresponding data-driven baseline method.It is significantly better than the variable selection only method based on empirical knowledge and in most cases outperforms the baseline method.The results showed that on all three datasets the hybrid strategy for variable selection combining empirical knowledge and data-driven algorithms achieved the lowest root mean square error of prediction(RMSEP)values of 1.605,3.478 and 1.647,respectively,which were significantly lower than those obtained from multiple linear regression using only 12 empirical variables,which are 1.959,3.718 and 2.181,respectively.The LASSO-PLSR model with empirical support and 20 selected variables exhibited a significantly improved performance after variable deselection,with RMSEP values dropping from 1.635,3.962 and 1.647 to 1.483,3.086 and 1.567,respectively.Such results demonstrate that using empirical knowledge as a support for datadriven variable selection can be a viable approach to improve the accuracy and reliability of LIBS quantification.
基金the National Natural Science Foundation of China (41171281, 40701120)the Beijing Nova Program, China (2008B33)
文摘Estimating wheat grain protein content by remote sensing is important for assessing wheat quality at maturity and making grains harvest and purchase policies. However, spatial variability of soil condition, temperature, and precipitation will affect grain protein contents and these factors usually cannot be monitored accurately by remote sensing data from single image. In this research, the relationships between wheat protein content at maturity and wheat agronomic parameters at different growing stages were analyzed and multi-temporal images of Landsat TM were used to estimate grain protein content by partial least squares regression. Experiment data were acquired in the suburb of Beijing during a 2-yr experiment in the period from 2003 to 2004. Determination coefficient, average deviation of self-modeling, and deviation of cross- validation were employed to assess the estimation accuracy of wheat grain protein content. Their values were 0.88, 1.30%, 3.81% and 0.72, 5.22%, 12.36% for 2003 and 2004, respectively. The research laid an agronomic foundation for GPC (grain protein content) estimation by multi-temporal remote sensing. The results showed that it is feasible to estimate GPC of wheat from multi-temporal remote sensing data in large area.
基金supported by the Fundamental Research Funds for the Provincial Universities of Zhejiang(Project No.GK229909299001-302)the National Natural Science Foundation of China(Project No.41901268)+1 种基金the Natural Science Foundation of Zhejiang Province(Project No.LQ19D010009)the Provincial Education Department General Scientific Research Items(Project No.Y202249845).
文摘Accurate assessment of canopy carotenoid content(CC_(x+c)C)in crops is central to monitor physiological conditions in plants and vegetation stress,and consequently supporting agronomic decisions.However,due to the overlap of absorption peaks of carotenoid(C_(x+c))and chlorophyll(C_(a+b)),accurate estimation of carotenoid using reflectance where carotenoid absorb is challenging.The objective of present study was to assess CC_(x+c)C in winter wheat(Triticum aestivum L.)with ground-and aircraft-based hyperspectral measurements in the visible and near-infrared spectrum.In-situ hyperspectral reflectance were measured and airborne hyperspectral data were acquired during major growth stages of winter wheat in five consecutive field experiments.At the canopy level,a remarkable linear relationship(R^(2)=0.95,p<0.001)existed between C_(x+c) and Ca+b,and correlation between CC_(x+c)C and wavelengths within 400 to 1000 nm range indicated that CC_(x+c)C could be estimated using reflectance ranging from visible to near-infrared wavebands.Results of Cx+c assessment based on chlorophyll and carotenoid indices showed that red edge chlorophyll index(CI red edge)performed with the highest accuracy(R^(2)=0.77,RMSE=22.27μg/cm^(2),MAE=4.97μg/cm^(2)).Applying partial least square regression(PLSR)in CC_(x+c)C retrieval emphasized the significance of reflectance within 700 to 750 nm range in CC_(x+c)C assessment.Based on CI red edge index,use of airborne hyperspectral imagery achieved satisfactory results in mapping the spatial distribution of CC_(x+c)C.This study demonstrates that it is feasible to accurately assess CC_(x+c)C in winter wheat with red edge chlorophyll index provided that C_(x+c) correlated well with C_(a+b) at the canopy scale.it is therefore a promising method for CC_(x+c)C retrieval at regional scale from aerial hyperspectral imagery.
基金supported partially by the USDA-ARS Research Project#6054-44000-080-00D.
文摘Background Fiber maturity is a key cotton quality property,and its variability in a sample impacts fiber processing and dyeing performance.Currently,the maturity is determined by using established protocols in laboratories under a controlled environment.There is an increasing need to measure fiber maturity using low-cost(in general less than $20000)and small portable systems.In this study,a laboratory feasibility was performed to assess the ability of the shortwave infrared hyperspectral imaging(SWIR HSI)technique for determining the conditioned fiber maturity,and as a comparison,a bench-top commercial and expensive(in general greater than $60000)near infrared(NIR)instrument was used.Results Although SWIR HSI and NIR represent different measurement technologies,consistent spectral characteristics were observed between the two instruments when they were used to measure the maturity of the locule fiber samples in seed cotton and of the well-defined fiber samples,respectively.Partial least squares(PLS)models were established using different spectral preprocessing parameters to predict fiber maturity.The high prediction precision was observed by a lower root mean square error of prediction(RMSEP)(<0.046),higher R_(p)^(2)(>0.518),and greater percentage(97.0%)of samples within the 95% agreement range in the entire NIR region(1000-2500 nm)without the moisture band at 1940 nm.Conclusion SWIR HSI has a good potential for assessing cotton fiber maturity in a laboratory environment.
文摘For optimization of production processes and product quality,often knowledge of the factors influencing the process outcome is compulsory.Thus,process analytical technology(PAT)that allows deeper insight into the process and results in a mathematical description of the process behavior as a simple function based on the most important process factors can help to achieve higher production efficiency and quality.The present study aims at characterizing a well-known industrial process,the transesterification reaction of rapeseed oil with methanol to produce fatty acid methyl esters(FAME)for usage as biodiesel in a continuous micro reactor set-up.To this end,a design of experiment approach is applied,where the effects of two process factors,the molar ratio and the total flow rate of the reactants,are investigated.The optimized process target response is the FAME mass fraction in the purified nonpolar phase of the product as a measure of reaction yield.The quantification is performed using attenuated total reflection infrared spectroscopy in combination with partial least squares regression.The data retrieved during the conduction of the DoE experimental plan were used for statistical analysis.A non-linear model indicating a synergistic interaction between the studied factors describes the reactor behavior with a high coefficient of determination(R^(2))of 0.9608.Thus,we applied a PAT approach to generate further insight into this established industrial process.
文摘Soil texture is an indicator of soil physical structure which delivers many ecological functions of soils such as thermal regime, plant growth, and soil quality. However, traditional methods for soil texture measurement are time-consuming and labor-intensive. This study attempts to explore an indirect method for rapid estimating the texture of three subgroups of purple soils (i.e. calcareous, neutral, and acidic). 190 topsoil (0 - 10 cm) samples were collected from sloping croplands in Tongnan and Beibei Districts of Chongqing Municipality in China. Vis-NIR spectrum was measured and processed, and stepwise multiple linear regression (SMLR), partial least squares regression (PLSR), and back propagation neural network (BPNN) models were constructed to inform the soil texture. The clay fractions ranged from 4.40% to 27.12% while sand fractions ranged from 0.34% to 36.57%, hereby soil samples encompass three textural classes (i.e. silt, silt loam, and silty clay loam). For the original spectrum, the texture of calcareous and neutral purple soils was not significantly correlated with spectral reflectance and linear models (SMLR and PLSR) exhibited low prediction accuracy. The correlation coefficients and the goodness-of-fits between soil texture and the transformed spectra of all soil groups increased by continuum-removal (CR), first-order differential (R'), and second-order differential (R") transformations. Among them, the R" had the best performance in terms of improving the correlation coefficients and the goodness-of-fits. For the calcareous purple soil, the SMLR exceeds PLSR and BPNN with a higher coefficient of determination (R<sup>2</sup>) and the ratio of performance to inter-quartile distance (RPIQ) values and lower root mean square error of validation (RMSEV), but for the neutral and acidic purple soils, the PLSR model has a better prediction accuracy. In summary, the linear methods (SMLR and PLSR) are more reliable in estimating the texture of the three purple soil groups when using Vis-NIR spectroscopy inversion.
基金Supported by National Natural Science Foundation of China (No.50478086)Tianjin Special Scientific Innovation Foundation (No.06FZZDSH00900)
文摘The water distribution system of one residential district in Tianjin is taken as an example to analyze the changes of water quality.Partial least squares(PLS) regression model,in which the turbidity and Fe are regarded as control objectives,is used to establish the statistical model.The experimental results indicate that the PLS regression model has good predicted results of water quality compared with the monitored data.The percentages of absolute relative error(below 15%,20%,30%) are 44.4%,66.7%,100%(turbidity) and 33.3%,44.4%,77.8%(Fe) on the 4th sampling point;77.8%,88.9%,88.9%(turbidity) and 44.4%,55.6%,66.7%(Fe) on the 5th sampling point.
文摘With the development of mid-infrared (MIR) photoelectric devices, mid-infrared spectroscopy has become one of the important methods for non-invasive detection of blood glucose. The mid-infrared region (4000 - 400 cm<sup>-1</sup>) has the well-known fingerprint region (1200 - 800 cm<sup>-1</sup>) of glucose, which has clearer characteristic absorption peaks and better specificity. There is a lot of molecular information about glucose in the MIR. The non-invasive detection of blood glucose by mid-infrared spectroscopy needs to achieve certain accuracy, and the quantitative model is an important factor affecting the accuracy of glucose detection. In this paper, the samples of imitation solution containing only glucose and the samples of imitation mixed solution are taken as the research objects, and the mid-infrared spectral data of the samples are collected. The full spectrum partial least squares Regression (PLSR) model, SNV + Ctr-PLSR model, MSC + Ctr-PLSR model, and convolutional neural networks (CNN) model of 3000 - 900 cm<sup>-1</sup> band were constructed. Full spectrum PLS model and CNN model of 1200 - 900 cm<sup>-1</sup> band were constructed. The experimental results show that the optimal model of the two bands is CNN, then the correlation coefficient of prediction set (Rp) of 3000 - 900 cm<sup>-1</sup> band is 0.95, and the root mean square error of pre-diction set (RMSEP) value is 22.10. The Rp of 1200 - 900 cm<sup>-1</sup> band is 0.95, and the RMSEP value is 22.54. The research results show that CNN is a promising method, which has higher accuracy than PLSR, and is especially suitable for modeling human complex environment. In addition, the study provides a theoretical and practical basis for CNN in feature selection and model interpretation.
基金National Natural Science Foundation of China No.40301038
文摘In several LUCC studies, statistical methods are being used to analyze land use data. A problem using conventional statistical methods in land use analysis is that these methods assume the data to be statistically independent. But in fact, they have the tendency to be dependent, a phenomenon known as multicollinearity, especially in the cases of few observations. In this paper, a Partial Least-Squares (PLS) regression approach is developed to study relationships between land use and its influencing factors through a case study of the Suzhou-Wuxi-Changzhou region in China. Multicollinearity exists in the dataset and the number of variables is high compared to the number of observations. Four PLS factors are selected through a preliminary analysis. The correlation analyses between land use and influencing factors demonstrate the land use character of rural industrialization and urbanization in the Suzhou-Wuxi-Changzhou region, meanwhile illustrate that the first PLS factor has enough ability to best describe land use patterns quantitatively, and most of the statistical relations derived from it accord with the fact. By the decreasing capacity of the PLS factors, the reliability of model outcome decreases correspondingly.
基金Supported by the National Social Science Foundation of China(Grant No.22BTJ059)。
文摘In this paper,we consider the partial linear regression model y_(i)=x_(i)β^(*)+g(ti)+ε_(i),i=1,2,...,n,where(x_(i),ti)are known fixed design points,g(·)is an unknown function,andβ^(*)is an unknown parameter to be estimated,random errorsε_(i)are(α,β)-mix_(i)ng random variables.The p-th(p>1)mean consistency,strong consistency and complete consistency for least squares estimators ofβ^(*)and g(·)are investigated under some mild conditions.In addition,a numerical simulation is carried out to study the finite sample performance of the theoretical results.Finally,a real data analysis is provided to further verify the effect of the model.
文摘Statistical downscaling (SD) analyzes relationship between local-scale response and global-scale predictors. The SD model can be used to forecast rainfall (local-scale) using global-scale precipitation from global circulation model output (GCM). The objectives of this research were to determine the time lag of GCM data and build SD model using PCR method with time lag of the GCM precipitation data. The observations of rainfall data in Indramayu were taken from 1979 to 2007 showing similar patterns with GCM data on 1st grid to 64th grid after time shift (time lag). The time lag was determined using the cross-correlation function. However, GCM data of 64 grids showed multicollinearity problem. This problem was solved by principal component regression (PCR), but the PCR model resulted heterogeneous errors. PCR model was modified to overcome the errors with adding dummy variables to the model. Dummy variables were determined based on partial least squares regression (PLSR). The PCR model with dummy variables improved the rainfall prediction. The SD model with lag-GCM predictors was also better than SD model without lag-GCM.
基金Project supported by the Fundamental Research Funds for the Central Universities, China (Grant No. 2019XD-A02)the National Natural Science Foundation of China (Grant Nos. U1636106, 61671087, 61170272, and 92046001)+2 种基金Natural Science Foundation of Beijing Municipality, China (Grant No. 4182006)Technological Special Project of Guizhou Province, China (Grant No. 20183001)the Foundation of Guizhou Provincial Key Laboratory of Public Big Data (Grant Nos. 2018BDKFJJ016 and 2018BDKFJJ018)。
文摘Partial least squares(PLS) regression is an important linear regression method that efficiently addresses the multiple correlation problem by combining principal component analysis and multiple regression. In this paper, we present a quantum partial least squares(QPLS) regression algorithm. To solve the high time complexity of the PLS regression, we design a quantum eigenvector search method to speed up principal components and regression parameters construction. Meanwhile, we give a density matrix product method to avoid multiple access to quantum random access memory(QRAM)during building residual matrices. The time and space complexities of the QPLS regression are logarithmic in the independent variable dimension n, the dependent variable dimension w, and the number of variables m. This algorithm achieves exponential speed-ups over the PLS regression on n, m, and w. In addition, the QPLS regression inspires us to explore more potential quantum machine learning applications in future works.
文摘On the basis of experimental observations on animals, applications to clinical data on patients and theoretical statistical reasoning, the author developed a com-puter-assisted general mathematical model of the ‘probacent’-probability equation, Equation (1) and death rate (mortality probability) equation, Equation (2) derivable from Equation (1) that may be applica-ble as a general approximation method to make use-ful predictions of probable outcomes in a variety of biomedical phenomena [1-4]. Equations (1) and (2) contain a constant, γ and c, respectively. In the pre-vious studies, the author used the least maximum- difference principle to determine these constants that were expected to best fit reported data, minimizing the deviation. In this study, the author uses the method of computer-assisted least sum of squares to determine the constants, γ and c in constructing the ‘probacent’-related formulas best fitting the NCHS- reported data on survival probabilities and death rates in the US total adult population for 2001. The results of this study reveal that the method of com-puter-assisted mathematical analysis with the least sum of squares seems to be simple, more accurate, convenient and preferable than the previously used least maximum-difference principle, and better fit-ting the NCHS-reported data on survival probabili-ties and death rates in the US total adult population. The computer program of curved regression for the ‘probacent’-probability and death rate equations may be helpful in research in biomedicine.
文摘Near infrared (NIR) hyperspectral imaging measurement of sugar content in peach was introauced. NIR spectral images (650~1 000 nm, resolution: 2 nm) of peach samples were captured with developed hyperspectral imaging setup. Partial least square (PLS) regression prediction model was developed to estimate the sugar content in peach; step-wise backward method was utilized to determine optimal wavelength subsets. Experimental results show that the calibration model with optimal wavelength subsets has a correlation coefficient of prediction of 0.97 and a standard error of prediction of 0.19, the prediction accuracy is higher than the calibration model applied over the whole wavelength, which proves that variable selection plays an important role in improving the prediction accuracy of PLS regression model.