Near infrared reflectance spectroscopy (NIRS), a non-destructive measurement technique, was combined with partial least squares regression discrimiant analysis (PLS-DA) to discriminate the transgenic (TCTP and mi...Near infrared reflectance spectroscopy (NIRS), a non-destructive measurement technique, was combined with partial least squares regression discrimiant analysis (PLS-DA) to discriminate the transgenic (TCTP and mi166) and wild type (Zhonghua 11) rice. Furthermore, rice lines transformed with protein gene (OsTCTP) and regulation gene (Osmi166) were also discriminated by the NIRS method. The performances of PLS-DA in spectral ranges of 4 000-8 000 cm-1 and 4 000-10 000 cm-1 were compared to obtain the optimal spectral range. As a result, the transgenic and wild type rice were distinguished from each other in the range of 4 000-10 000 cm-1, and the correct classification rate was 100.0% in the validation test. The transgenic rice TCTP and mi166 were also distinguished from each other in the range of 4 000-10 000 cm-1, and the correct classification rate was also 100.0%. In conclusion, NIRS combined with PLS-DA can be used for the discrimination of transgenic rice.展开更多
Many complex traits are highly correlated rather than independent. By taking the correlation structure of multiple traits into account, joint association analyses can achieve both higher statistical power and more acc...Many complex traits are highly correlated rather than independent. By taking the correlation structure of multiple traits into account, joint association analyses can achieve both higher statistical power and more accurate estimation. To develop a statistical approach to joint association analysis that includes allele detection and genetic effect estimation, we combined multivariate partial least squares regression with variable selection strategies and selected the optimal model using the Bayesian Information Criterion(BIC). We then performed extensive simulations under varying heritabilities and sample sizes to compare the performance achieved using our method with those obtained by single-trait multilocus methods. Joint association analysis has measurable advantages over single-trait methods, as it exhibits superior gene detection power, especially for pleiotropic genes. Sample size, heritability,polymorphic information content(PIC), and magnitude of gene effects influence the statistical power, accuracy and precision of effect estimation by the joint association analysis.展开更多
With recent advances in biotechnology, genome-wide association study (GWAS) has been widely used to identify genetic variants that underlie human complex diseases and traits. In case-control GWAS, typical statistica...With recent advances in biotechnology, genome-wide association study (GWAS) has been widely used to identify genetic variants that underlie human complex diseases and traits. In case-control GWAS, typical statistical strategy is traditional logistical regression (LR) based on single-locus analysis. However, such a single-locus analysis leads to the well-known multiplicity problem, with a risk of inflating type I error and reducing power. Dimension reduction-based techniques, such as principal component-based logistic regression (PC-LR), partial least squares-based logistic regression (PLS-LR), have recently gained much attention in the analysis of high dimensional genomic data. However, the perfor- mance of these methods is still not clear, especially in GWAS. We conducted simulations and real data application to compare the type I error and power of PC-LR, PLS-LR and LR applicable to GWAS within a defined single nucleotide polymorphism (SNP) set region. We found that PC-LR and PLS can reasonably control type I error under null hypothesis. On contrast, LR, which is corrected by Bonferroni method, was more conserved in all simulation settings. In particular, we found that PC-LR and PLS-LR had comparable power and they both outperformed LR, especially when the causal SNP was in high linkage disequilibrium with genotyped ones and with a small effective size in simulation. Based on SNP set analysis, we applied all three methods to analyze non-small cell lung cancer GWAS data.展开更多
In several LUCC studies, statistical methods are being used to analyze land use data. A problem using conventional statistical methods in land use analysis is that these methods assume the data to be statistically ind...In several LUCC studies, statistical methods are being used to analyze land use data. A problem using conventional statistical methods in land use analysis is that these methods assume the data to be statistically independent. But in fact, they have the tendency to be dependent, a phenomenon known as multicollinearity, especially in the cases of few observations. In this paper, a Partial Least-Squares (PLS) regression approach is developed to study relationships between land use and its influencing factors through a case study of the Suzhou-Wuxi-Changzhou region in China. Multicollinearity exists in the dataset and the number of variables is high compared to the number of observations. Four PLS factors are selected through a preliminary analysis. The correlation analyses between land use and influencing factors demonstrate the land use character of rural industrialization and urbanization in the Suzhou-Wuxi-Changzhou region, meanwhile illustrate that the first PLS factor has enough ability to best describe land use patterns quantitatively, and most of the statistical relations derived from it accord with the fact. By the decreasing capacity of the PLS factors, the reliability of model outcome decreases correspondingly.展开更多
Boosting algorithms are a class of general methods used to improve the general periormance of regression analysis. The main idea is to maintain a distribution over the train set. In order to use the given distribution...Boosting algorithms are a class of general methods used to improve the general periormance of regression analysis. The main idea is to maintain a distribution over the train set. In order to use the given distribution directly, a modified PLS algorithm is proposed and used as the base learner to deal with the nonlinear multivariate regression problems. Experiments on gasoline octane number prediction demonstrate that boosting the modified PLS algorithm has better general performance over the PLS algorithm.展开更多
Breast cancer is one of the malignant tumors having high incidence in women,the incidence of breast cancer has increased in all parts of the world since twentieth century,but its etiology is not yet completely clear,s...Breast cancer is one of the malignant tumors having high incidence in women,the incidence of breast cancer has increased in all parts of the world since twentieth century,but its etiology is not yet completely clear,so it is very important to detect breast cells.In this paper,we built a regression model to detect breast cells,and generated a method for predicting the formation of benign and malignant breast cells by training the model,then we used the 10 features of breast cells to predict it,the results reaching upto 93.67%accuracy,it was very effective to predict and analyse whether the breast cells getting cancer,It had an important role in the diagnosis and prevention of breast cancer.展开更多
Laser-induced breakdown spectroscopy(LIBS)has become a widely used atomic spectroscopic technique for rapid coal analysis.However,the vast amount of spectral information in LIBS contains signal uncertainty,which can a...Laser-induced breakdown spectroscopy(LIBS)has become a widely used atomic spectroscopic technique for rapid coal analysis.However,the vast amount of spectral information in LIBS contains signal uncertainty,which can affect its quantification performance.In this work,we propose a hybrid variable selection method to improve the performance of LIBS quantification.Important variables are first identified using Pearson's correlation coefficient,mutual information,least absolute shrinkage and selection operator(LASSO)and random forest,and then filtered and combined with empirical variables related to fingerprint elements of coal ash content.Subsequently,these variables are fed into a partial least squares regression(PLSR).Additionally,in some models,certain variables unrelated to ash content are removed manually to study the impact of variable deselection on model performance.The proposed hybrid strategy was tested on three LIBS datasets for quantitative analysis of coal ash content and compared with the corresponding data-driven baseline method.It is significantly better than the variable selection only method based on empirical knowledge and in most cases outperforms the baseline method.The results showed that on all three datasets the hybrid strategy for variable selection combining empirical knowledge and data-driven algorithms achieved the lowest root mean square error of prediction(RMSEP)values of 1.605,3.478 and 1.647,respectively,which were significantly lower than those obtained from multiple linear regression using only 12 empirical variables,which are 1.959,3.718 and 2.181,respectively.The LASSO-PLSR model with empirical support and 20 selected variables exhibited a significantly improved performance after variable deselection,with RMSEP values dropping from 1.635,3.962 and 1.647 to 1.483,3.086 and 1.567,respectively.Such results demonstrate that using empirical knowledge as a support for datadriven variable selection can be a viable approach to improve the accuracy and reliability of LIBS quantification.展开更多
Powdery mildew (Blumeria graminis) is one of the most destructive crop diseases infecting winter wheat plants, and has devastated millions of hectares of farmlands in China. The objective of this study is to detect ...Powdery mildew (Blumeria graminis) is one of the most destructive crop diseases infecting winter wheat plants, and has devastated millions of hectares of farmlands in China. The objective of this study is to detect the disease damage of powdery mildew on leaf level by means of the hyperspectral measurements, particularly using the continuous wavelet analysis. In May 2010, the reflectance spectra and the biochemical properties were measured for 114 leaf samples with various disease severity degrees. A hyperspectral imaging system was also employed for obtaining detailed hyperspectral information of the normal and the pustule areas within one diseased leaf. Based on these spectra data, a continuous wavelet analysis (CWA) was carried out in conjunction with a correlation analysis, which generated a so-called correlation scalogram that summarizes the correlations between disease severity and the wavelet power at different wavelengths and decomposition scales. By using a thresholding approach, seven wavelet features were isolated for developing models in determining disease severity. In addition, 22 conventional spectral features (SFs) were also tested and compared with wavelet features for their efficiency in estimating disease severity. The multivariate linear regression (MLR) analysis and the partial least square regression (PLSR) analysis were adopted as training methods in model mildew on leaf level were found to be closely related with the development. The spectral characteristics of the powdery spectral characteristics of the pustule area and the content of chlorophyll. The wavelet features performed better than the conventional SFs in capturing this spectral change. Moreover, the regression model composed by seven wavelet features outperformed (R2=0.77, relative root mean square error RRMSE=0.28) the model composed by 14 optimal conventional SFs (R2---0.69, RRMSE--0.32) in estimating the disease severity. The PLSR method yielded a higher accuracy than the MLR method. A combination of CWA and PLSR was found to be promising in providing relatively accurate estimates of disease severity of powdery mildew on leaf level.展开更多
Multivariate statistical process monitoring and control (MSPM&C) methods for chemical process monitoring with statistical projection techniques such as principal component analysis (PCA) and partial least squares ...Multivariate statistical process monitoring and control (MSPM&C) methods for chemical process monitoring with statistical projection techniques such as principal component analysis (PCA) and partial least squares (PLS) are surveyed in this paper. The four-step procedure of performing MSPM&C for chemical process, modeling of processes, detecting abnormal events or faults, identifying the variable(s) responsible for the faults and diagnosing the source cause for the abnormal behavior, is analyzed. Several main research directions of MSPM&C reported in the literature are discussed, such as multi-way principal component analysis (MPCA) for batch process, statistical monitoring and control for nonlinear process, dynamic PCA and dynamic PLS, and on-line quality control by inferential models. Industrial applications of MSPM&C to several typical chemical processes, such as chemical reactor, distillation column, polymerization process, petroleum refinery units, are summarized. Finally, some concluding remarks and future considerations are made.展开更多
The performance of different chemometric approaches was evaluated in the spectrophotometric determination of pharmaceutical mixtures characterized by having the amount of components with a very high ratio. Principal c...The performance of different chemometric approaches was evaluated in the spectrophotometric determination of pharmaceutical mixtures characterized by having the amount of components with a very high ratio. Principal component regression (PCR), partial least squares with one dependent variable (PLS1) or multi-dependent variables (PLS2), and multivariate curve resolution (MCR) were applied to the spectral data of a ternary mixture containing paracetamol, sodium ascorbate and chlorpheniramine (150:140:1, m/m/m), and a quaternary mixture containing paracetamol, caffeine, phenylephrine and chlorpheniramine (125:6. 25:1.25:1, m/m/m/m). The UV spectra of the calibration samples in the range of 200-320 nm were pre-treated by removing noise and useless data, and the wavelength regions having the most useful analytical information were selected using the regression coefficients calculated in the multivariate modeling. All the defined chemometric models were validated on external sample sets and then applied to commercial pharmaceutical formulations. Different data intervals, fixed at 0.5, 1.0, and 2.0 point/nm, were tested to optimize the prediction ability of the models. The best results were obtained using the PLSlcalibration models and the quantification of the species of a lower amount was sig- nificantly improved by adopting 0.5 data interval, which showed accuracy between 94.24% and 107.76%.展开更多
A simple and rapid analytical method for the simultaneous quantification of three commercial azo dyes—Tartrazine (TAR), Congo Red (CR), and Amido Black (AB) in water is presented. The simultaneous assessment of the i...A simple and rapid analytical method for the simultaneous quantification of three commercial azo dyes—Tartrazine (TAR), Congo Red (CR), and Amido Black (AB) in water is presented. The simultaneous assessment of the individual concentration of an organic dye in mixtures using a spectrophotometric method is a difficult procedure in analytical chemistry, due to spectral overlapping. This drawback can be overcome if a multivariate calibration method such as Partial Least Squares Regression (PLSR) is used. This study presents a calibration model based on absorption spectra in the 300 - 650 nm range for a set of 20 different mixtures of dyes, followed by the prediction of the concentrations of dyes in 6 validation mixtures, randomly selected, using the PLSR method. Estimated limits of detection (LOD) were 0.106, 0.047 and 0.079 mg/L for TAR, CR, and AB, respectively, and limits of quantification (LOQ) were 0.355, 0.157 and 0.265 mg/L for TAR, CR, and AB, respectively. Quantitative determination of the three azo dyes was performed following optimized adsorption experiments onto chitosan beads of mixtures of TAR, CR and AB. Adsorption isotherm and kinetic studies were carried out, proving that the proposed PLSR method is rapid, accurate and reliable.展开更多
In this work,multivariate detection limits(MDL)estimator was obtained based on the microelectro-mechanical systems–near infrared(MEMS–NIR)technology coupled with two sampling accessories to assess the detection capa...In this work,multivariate detection limits(MDL)estimator was obtained based on the microelectro-mechanical systems–near infrared(MEMS–NIR)technology coupled with two sampling accessories to assess the detection capability of four quality parameters(glycyrrhizic acid,liquiritin,liquiritigenin and isoliquiritin)in licorice from di®erent geographical regions.112 licorice samples were divided into two parts(calibration set and prediction set)using Kennard–Stone(KS)method.Four quality parameters were measured using high-performance liquid chromatography(HPLC)method according to Chinese pharmacopoeia and previous studies.The MEMS–NIR spectra were acquired from¯ber optic probe(FOP)and integrating sphere,then the partial least squares(PLS)model was obtained using the optimum processing method.Chemometrics indicators have been utilized to assess the PLS model performance.Model assessment using chemometrics indicators is based on relative mean prediction error of all concentration levels,which indicated relatively low sensitivity for low-content analytes(below 1000 parts per million(ppm)).Therefore,MDL estimator was introduced with alpha error and beta error based on good prediction characteristic of low concentration levels.The result suggested that MEMS–NIR technology coupled with fiber optic probe(FOP)and integrating sphere was able to detect minor analytes.The result further demonstrated that integrating sphere mode(i.e.,MDL0:05;0:05,0.22%)was more robust than FOP mode(i.e.,MDL0:05;0:05,0.48%).In conclusion,this research proposed that MDL method was helpful to determine the detection capabilities of low-content analytes using MEMS–NIR technology and successful to compare two sampling accessories.展开更多
In this study, two functional logistic regression models with functional principal component basis (FPCA) and functional partial least squares basis (FPLS) have been developed to distinguish precancerous adenomatous p...In this study, two functional logistic regression models with functional principal component basis (FPCA) and functional partial least squares basis (FPLS) have been developed to distinguish precancerous adenomatous polyps from hyperplastic polyps for the purpose of classification and interpretation. The classification performances of the two functional models have been compared with two widely used multivariate methods, principal component discriminant analysis (PCDA) and partial least squares discriminant analysis (PLSDA). The results indicated that classification abilities of FPCA and FPLS models outperformed those of the PCDA and PLSDA models by using a small number of functional basis components. With substantial reduction in model complexity and improvement of classification accuracy, it is particularly helpful for interpretation of the complex spectral features related to precancerous colon polyps.展开更多
This study evaluates the operational performance of all routes of Sajha Bus Yatayat operating inside Kathmandu valley using Data Envelopment Analysis (DEA) in terms of efficiency and effectiveness score. This approach...This study evaluates the operational performance of all routes of Sajha Bus Yatayat operating inside Kathmandu valley using Data Envelopment Analysis (DEA) in terms of efficiency and effectiveness score. This approach allows us to access the relative performance of transit system in absence of historical data and research to compare with. To explore the possibility of enhancing the performance, scenarios were created for relatively underperforming routes and long route problem by changing the most important input variable and output variables accordingly with regression model where it was relevant. Partial Least Squares (PLS) regression was used to determine the most influential input variables to the output variables. DEA was conducted to access the performance of all routes under these scenarios. Underperforming routes except the longest route under the first set of scenarios, emerge to be better performing efficiently without considerable negative deviation in effectiveness. The result of second set of scenarios for long route problem suggests that the longest route’s performance can be enhanced significantly upon proper route alignment. Scenarios development and evaluation can help lead transit companies to explore the strategies to facilitate operational performance enhancement.展开更多
The objective of this paper is to present a review of different calibration and classification methods for functional data in the context of chemometric applications. In chemometric, it is usual to measure certain par...The objective of this paper is to present a review of different calibration and classification methods for functional data in the context of chemometric applications. In chemometric, it is usual to measure certain parameters in terms of a set of spectrometric curves that are observed in a finite set of points (functional data). Although the predictor variable is clearly functional, this problem is usually solved by using multivariate calibration techniques that consider it as a finite set of variables associated with the observed points (wavelengths or times). But these explicative variables are highly correlated and it is therefore more informative to reconstruct first the true functional form of the predictor curves. Although it has been published in several articles related to the implementation of functional data analysis techniques in chemometric, their power to solve real problems is not yet well known. Because of this the extension of multivariate calibration techniques (linear regression, principal component regression and partial least squares) and classification methods (linear discriminant analysis and logistic regression) to the functional domain and some relevant chemometric applications are reviewed in this paper.展开更多
目的建立同步检测畲药树参中紫丁香苷、绿原酸、芥子醛葡萄糖苷、松柏醇、芦丁、山柰酚-3-O-芸香糖苷、3,4-O-二咖啡酰基奎宁酸、3,5-O-二咖啡酰基奎宁酸和4,5-O-二咖啡酰基奎宁酸含量的高效液相色谱一测多评(HPLC-QAMS)方法,并采用多...目的建立同步检测畲药树参中紫丁香苷、绿原酸、芥子醛葡萄糖苷、松柏醇、芦丁、山柰酚-3-O-芸香糖苷、3,4-O-二咖啡酰基奎宁酸、3,5-O-二咖啡酰基奎宁酸和4,5-O-二咖啡酰基奎宁酸含量的高效液相色谱一测多评(HPLC-QAMS)方法,并采用多元统计分析及加权优劣解距离(technique for order preference by similarity to ideal solution method,TOPSIS)法对其品质进行综合评价。方法以Waters Xbridge C 18色谱柱;乙腈-0.05%甲酸溶液为流动相,梯度洗脱;检测波长260 nm。以山柰酚-3-O-芸香糖苷为参照物,建立内参物与其他8个待测成分的相对校正因子(relative correction factor,RCF),进行RCF耐用性考察及色谱峰定位,同时与外标法实测结果进行对比,验证HPLC-QAMS法准确性和可靠性。运用主成分分析(principal component analysis,PCA)、正交偏最小二乘法-判别分析(orthogonal partial least squares-discriminant analysis,OPLS-DA)等多元统计分析以及W-TOPSIS法对9个成分HPLC-QAMS法含量结果的相关性进行分析,挖掘影响畲药树参产品质量的主要潜在标志物,建立畲药树参综合质量优劣评价方法。结果9种成分分别在3.27~81.75μg/mL、9.85~246.25μg/mL、0.43~0.75μg/mL、0.31~7.75μg/mL、1.58~39.50μg/mL、0.59~14.75μg/mL、1.26~31.50μg/mL、4.55~113.75μg/mL和1.98~49.50μg/mL范围内线性关系良好,平均加样回收率96.82%~100.07%(RSD<2.0%);HPLC-QAMS和外标法(ESM)含量测定结果差异无统计学意义(P>0.05),HPLC-QAMS法可用于畲药树参多组分定量控制;多元统计分析结果显示,前2个主成分累计方差贡献率89.589%,绿原酸、紫丁香苷、3,5-O-二咖啡酰基奎宁酸和4,5-O-二咖啡酰基奎宁酸是影响畲药树参产品质量的主要潜在标志物;加权TOPSIS法结果显示浙江地区所得畲药树参质量最优,其次为江西、安徽、湖南和湖北产树参,云南和贵州产树参位于排名后4位。结论所建立的HPLC-QAMS多组分定量控制方法,操作便捷、结果准确;多元统计分析联合加权TOPSIS法全面客观,可用于畲药树参品质的综合评价。展开更多
基金supported by the projects under the Innovation Team of the Safety Standards and Testing Technology for Agricultural Products of Zhejiang Province, China (Grant No.2010R50028)the National Key Technologies R&D Program of China during the 11th Five-Year Plan Period (Grant No.2006BAK02A18)
文摘Near infrared reflectance spectroscopy (NIRS), a non-destructive measurement technique, was combined with partial least squares regression discrimiant analysis (PLS-DA) to discriminate the transgenic (TCTP and mi166) and wild type (Zhonghua 11) rice. Furthermore, rice lines transformed with protein gene (OsTCTP) and regulation gene (Osmi166) were also discriminated by the NIRS method. The performances of PLS-DA in spectral ranges of 4 000-8 000 cm-1 and 4 000-10 000 cm-1 were compared to obtain the optimal spectral range. As a result, the transgenic and wild type rice were distinguished from each other in the range of 4 000-10 000 cm-1, and the correct classification rate was 100.0% in the validation test. The transgenic rice TCTP and mi166 were also distinguished from each other in the range of 4 000-10 000 cm-1, and the correct classification rate was also 100.0%. In conclusion, NIRS combined with PLS-DA can be used for the discrimination of transgenic rice.
基金supported by grants from the National Program on the Development of Basic Research (2011CB100100)the Priority Academic Program Development of Jiangsu Higher Education Institutions, the National Natural Science Foundations (31391632, 31200943, 31171187, and 91535103)+3 种基金the National High-tech R&D Program (863 Program) (2014AA10A601-5)the Natural Science Foundations of Jiangsu Province (BK20150010)the Natural Science Foundation of the Jiangsu Higher Education Institutions (14KJA210005)the Innovative Research Team of Universities in Jiangsu Province (KYLX_1352)
文摘Many complex traits are highly correlated rather than independent. By taking the correlation structure of multiple traits into account, joint association analyses can achieve both higher statistical power and more accurate estimation. To develop a statistical approach to joint association analysis that includes allele detection and genetic effect estimation, we combined multivariate partial least squares regression with variable selection strategies and selected the optimal model using the Bayesian Information Criterion(BIC). We then performed extensive simulations under varying heritabilities and sample sizes to compare the performance achieved using our method with those obtained by single-trait multilocus methods. Joint association analysis has measurable advantages over single-trait methods, as it exhibits superior gene detection power, especially for pleiotropic genes. Sample size, heritability,polymorphic information content(PIC), and magnitude of gene effects influence the statistical power, accuracy and precision of effect estimation by the joint association analysis.
基金founded by the National Natural Science Foundation of China(81202283,81473070,81373102 and81202267)Key Grant of Natural Science Foundation of the Jiangsu Higher Education Institutions of China(10KJA330034 and11KJA330001)+1 种基金the Research Fund for the Doctoral Program of Higher Education of China(20113234110002)the Priority Academic Program for the Development of Jiangsu Higher Education Institutions(Public Health and Preventive Medicine)
文摘With recent advances in biotechnology, genome-wide association study (GWAS) has been widely used to identify genetic variants that underlie human complex diseases and traits. In case-control GWAS, typical statistical strategy is traditional logistical regression (LR) based on single-locus analysis. However, such a single-locus analysis leads to the well-known multiplicity problem, with a risk of inflating type I error and reducing power. Dimension reduction-based techniques, such as principal component-based logistic regression (PC-LR), partial least squares-based logistic regression (PLS-LR), have recently gained much attention in the analysis of high dimensional genomic data. However, the perfor- mance of these methods is still not clear, especially in GWAS. We conducted simulations and real data application to compare the type I error and power of PC-LR, PLS-LR and LR applicable to GWAS within a defined single nucleotide polymorphism (SNP) set region. We found that PC-LR and PLS can reasonably control type I error under null hypothesis. On contrast, LR, which is corrected by Bonferroni method, was more conserved in all simulation settings. In particular, we found that PC-LR and PLS-LR had comparable power and they both outperformed LR, especially when the causal SNP was in high linkage disequilibrium with genotyped ones and with a small effective size in simulation. Based on SNP set analysis, we applied all three methods to analyze non-small cell lung cancer GWAS data.
基金National Natural Science Foundation of China No.40301038
文摘In several LUCC studies, statistical methods are being used to analyze land use data. A problem using conventional statistical methods in land use analysis is that these methods assume the data to be statistically independent. But in fact, they have the tendency to be dependent, a phenomenon known as multicollinearity, especially in the cases of few observations. In this paper, a Partial Least-Squares (PLS) regression approach is developed to study relationships between land use and its influencing factors through a case study of the Suzhou-Wuxi-Changzhou region in China. Multicollinearity exists in the dataset and the number of variables is high compared to the number of observations. Four PLS factors are selected through a preliminary analysis. The correlation analyses between land use and influencing factors demonstrate the land use character of rural industrialization and urbanization in the Suzhou-Wuxi-Changzhou region, meanwhile illustrate that the first PLS factor has enough ability to best describe land use patterns quantitatively, and most of the statistical relations derived from it accord with the fact. By the decreasing capacity of the PLS factors, the reliability of model outcome decreases correspondingly.
基金This work was supported by the National High-tech Research and Development Program of China (No. 2003AA412110).
文摘Boosting algorithms are a class of general methods used to improve the general periormance of regression analysis. The main idea is to maintain a distribution over the train set. In order to use the given distribution directly, a modified PLS algorithm is proposed and used as the base learner to deal with the nonlinear multivariate regression problems. Experiments on gasoline octane number prediction demonstrate that boosting the modified PLS algorithm has better general performance over the PLS algorithm.
文摘Breast cancer is one of the malignant tumors having high incidence in women,the incidence of breast cancer has increased in all parts of the world since twentieth century,but its etiology is not yet completely clear,so it is very important to detect breast cells.In this paper,we built a regression model to detect breast cells,and generated a method for predicting the formation of benign and malignant breast cells by training the model,then we used the 10 features of breast cells to predict it,the results reaching upto 93.67%accuracy,it was very effective to predict and analyse whether the breast cells getting cancer,It had an important role in the diagnosis and prevention of breast cancer.
基金financial supports from National Natural Science Foundation of China(No.62205172)Huaneng Group Science and Technology Research Project(No.HNKJ22-H105)Tsinghua University Initiative Scientific Research Program and the International Joint Mission on Climate Change and Carbon Neutrality。
文摘Laser-induced breakdown spectroscopy(LIBS)has become a widely used atomic spectroscopic technique for rapid coal analysis.However,the vast amount of spectral information in LIBS contains signal uncertainty,which can affect its quantification performance.In this work,we propose a hybrid variable selection method to improve the performance of LIBS quantification.Important variables are first identified using Pearson's correlation coefficient,mutual information,least absolute shrinkage and selection operator(LASSO)and random forest,and then filtered and combined with empirical variables related to fingerprint elements of coal ash content.Subsequently,these variables are fed into a partial least squares regression(PLSR).Additionally,in some models,certain variables unrelated to ash content are removed manually to study the impact of variable deselection on model performance.The proposed hybrid strategy was tested on three LIBS datasets for quantitative analysis of coal ash content and compared with the corresponding data-driven baseline method.It is significantly better than the variable selection only method based on empirical knowledge and in most cases outperforms the baseline method.The results showed that on all three datasets the hybrid strategy for variable selection combining empirical knowledge and data-driven algorithms achieved the lowest root mean square error of prediction(RMSEP)values of 1.605,3.478 and 1.647,respectively,which were significantly lower than those obtained from multiple linear regression using only 12 empirical variables,which are 1.959,3.718 and 2.181,respectively.The LASSO-PLSR model with empirical support and 20 selected variables exhibited a significantly improved performance after variable deselection,with RMSEP values dropping from 1.635,3.962 and 1.647 to 1.483,3.086 and 1.567,respectively.Such results demonstrate that using empirical knowledge as a support for datadriven variable selection can be a viable approach to improve the accuracy and reliability of LIBS quantification.
基金the National Natural Science Foundation of China (41101395, 41071276, 31071324)the Beijing Municipal Natural Science Foundation, China (4122032)the National Basic Research Program of China (2011CB311806)
文摘Powdery mildew (Blumeria graminis) is one of the most destructive crop diseases infecting winter wheat plants, and has devastated millions of hectares of farmlands in China. The objective of this study is to detect the disease damage of powdery mildew on leaf level by means of the hyperspectral measurements, particularly using the continuous wavelet analysis. In May 2010, the reflectance spectra and the biochemical properties were measured for 114 leaf samples with various disease severity degrees. A hyperspectral imaging system was also employed for obtaining detailed hyperspectral information of the normal and the pustule areas within one diseased leaf. Based on these spectra data, a continuous wavelet analysis (CWA) was carried out in conjunction with a correlation analysis, which generated a so-called correlation scalogram that summarizes the correlations between disease severity and the wavelet power at different wavelengths and decomposition scales. By using a thresholding approach, seven wavelet features were isolated for developing models in determining disease severity. In addition, 22 conventional spectral features (SFs) were also tested and compared with wavelet features for their efficiency in estimating disease severity. The multivariate linear regression (MLR) analysis and the partial least square regression (PLSR) analysis were adopted as training methods in model mildew on leaf level were found to be closely related with the development. The spectral characteristics of the powdery spectral characteristics of the pustule area and the content of chlorophyll. The wavelet features performed better than the conventional SFs in capturing this spectral change. Moreover, the regression model composed by seven wavelet features outperformed (R2=0.77, relative root mean square error RRMSE=0.28) the model composed by 14 optimal conventional SFs (R2---0.69, RRMSE--0.32) in estimating the disease severity. The PLSR method yielded a higher accuracy than the MLR method. A combination of CWA and PLSR was found to be promising in providing relatively accurate estimates of disease severity of powdery mildew on leaf level.
基金Supported by the National High-Tech Development Program of China(No.863-511-920-011,2001AA411230).
文摘Multivariate statistical process monitoring and control (MSPM&C) methods for chemical process monitoring with statistical projection techniques such as principal component analysis (PCA) and partial least squares (PLS) are surveyed in this paper. The four-step procedure of performing MSPM&C for chemical process, modeling of processes, detecting abnormal events or faults, identifying the variable(s) responsible for the faults and diagnosing the source cause for the abnormal behavior, is analyzed. Several main research directions of MSPM&C reported in the literature are discussed, such as multi-way principal component analysis (MPCA) for batch process, statistical monitoring and control for nonlinear process, dynamic PCA and dynamic PLS, and on-line quality control by inferential models. Industrial applications of MSPM&C to several typical chemical processes, such as chemical reactor, distillation column, polymerization process, petroleum refinery units, are summarized. Finally, some concluding remarks and future considerations are made.
基金Ministero dell'Istruzione,dell'Universitàe della Ricerca(MIUR),Italy,for the financial support to this work,grant 60%2014
文摘The performance of different chemometric approaches was evaluated in the spectrophotometric determination of pharmaceutical mixtures characterized by having the amount of components with a very high ratio. Principal component regression (PCR), partial least squares with one dependent variable (PLS1) or multi-dependent variables (PLS2), and multivariate curve resolution (MCR) were applied to the spectral data of a ternary mixture containing paracetamol, sodium ascorbate and chlorpheniramine (150:140:1, m/m/m), and a quaternary mixture containing paracetamol, caffeine, phenylephrine and chlorpheniramine (125:6. 25:1.25:1, m/m/m/m). The UV spectra of the calibration samples in the range of 200-320 nm were pre-treated by removing noise and useless data, and the wavelength regions having the most useful analytical information were selected using the regression coefficients calculated in the multivariate modeling. All the defined chemometric models were validated on external sample sets and then applied to commercial pharmaceutical formulations. Different data intervals, fixed at 0.5, 1.0, and 2.0 point/nm, were tested to optimize the prediction ability of the models. The best results were obtained using the PLSlcalibration models and the quantification of the species of a lower amount was sig- nificantly improved by adopting 0.5 data interval, which showed accuracy between 94.24% and 107.76%.
文摘A simple and rapid analytical method for the simultaneous quantification of three commercial azo dyes—Tartrazine (TAR), Congo Red (CR), and Amido Black (AB) in water is presented. The simultaneous assessment of the individual concentration of an organic dye in mixtures using a spectrophotometric method is a difficult procedure in analytical chemistry, due to spectral overlapping. This drawback can be overcome if a multivariate calibration method such as Partial Least Squares Regression (PLSR) is used. This study presents a calibration model based on absorption spectra in the 300 - 650 nm range for a set of 20 different mixtures of dyes, followed by the prediction of the concentrations of dyes in 6 validation mixtures, randomly selected, using the PLSR method. Estimated limits of detection (LOD) were 0.106, 0.047 and 0.079 mg/L for TAR, CR, and AB, respectively, and limits of quantification (LOQ) were 0.355, 0.157 and 0.265 mg/L for TAR, CR, and AB, respectively. Quantitative determination of the three azo dyes was performed following optimized adsorption experiments onto chitosan beads of mixtures of TAR, CR and AB. Adsorption isotherm and kinetic studies were carried out, proving that the proposed PLSR method is rapid, accurate and reliable.
基金This work was financially supported fromthe National Natural Science Foundation of China(81303218)Doctoral Fund of China (20130013120006)Special Fund of Outstanding Young Teachers and Innovation Team.
文摘In this work,multivariate detection limits(MDL)estimator was obtained based on the microelectro-mechanical systems–near infrared(MEMS–NIR)technology coupled with two sampling accessories to assess the detection capability of four quality parameters(glycyrrhizic acid,liquiritin,liquiritigenin and isoliquiritin)in licorice from di®erent geographical regions.112 licorice samples were divided into two parts(calibration set and prediction set)using Kennard–Stone(KS)method.Four quality parameters were measured using high-performance liquid chromatography(HPLC)method according to Chinese pharmacopoeia and previous studies.The MEMS–NIR spectra were acquired from¯ber optic probe(FOP)and integrating sphere,then the partial least squares(PLS)model was obtained using the optimum processing method.Chemometrics indicators have been utilized to assess the PLS model performance.Model assessment using chemometrics indicators is based on relative mean prediction error of all concentration levels,which indicated relatively low sensitivity for low-content analytes(below 1000 parts per million(ppm)).Therefore,MDL estimator was introduced with alpha error and beta error based on good prediction characteristic of low concentration levels.The result suggested that MEMS–NIR technology coupled with fiber optic probe(FOP)and integrating sphere was able to detect minor analytes.The result further demonstrated that integrating sphere mode(i.e.,MDL0:05;0:05,0.22%)was more robust than FOP mode(i.e.,MDL0:05;0:05,0.48%).In conclusion,this research proposed that MDL method was helpful to determine the detection capabilities of low-content analytes using MEMS–NIR technology and successful to compare two sampling accessories.
文摘In this study, two functional logistic regression models with functional principal component basis (FPCA) and functional partial least squares basis (FPLS) have been developed to distinguish precancerous adenomatous polyps from hyperplastic polyps for the purpose of classification and interpretation. The classification performances of the two functional models have been compared with two widely used multivariate methods, principal component discriminant analysis (PCDA) and partial least squares discriminant analysis (PLSDA). The results indicated that classification abilities of FPCA and FPLS models outperformed those of the PCDA and PLSDA models by using a small number of functional basis components. With substantial reduction in model complexity and improvement of classification accuracy, it is particularly helpful for interpretation of the complex spectral features related to precancerous colon polyps.
文摘This study evaluates the operational performance of all routes of Sajha Bus Yatayat operating inside Kathmandu valley using Data Envelopment Analysis (DEA) in terms of efficiency and effectiveness score. This approach allows us to access the relative performance of transit system in absence of historical data and research to compare with. To explore the possibility of enhancing the performance, scenarios were created for relatively underperforming routes and long route problem by changing the most important input variable and output variables accordingly with regression model where it was relevant. Partial Least Squares (PLS) regression was used to determine the most influential input variables to the output variables. DEA was conducted to access the performance of all routes under these scenarios. Underperforming routes except the longest route under the first set of scenarios, emerge to be better performing efficiently without considerable negative deviation in effectiveness. The result of second set of scenarios for long route problem suggests that the longest route’s performance can be enhanced significantly upon proper route alignment. Scenarios development and evaluation can help lead transit companies to explore the strategies to facilitate operational performance enhancement.
文摘The objective of this paper is to present a review of different calibration and classification methods for functional data in the context of chemometric applications. In chemometric, it is usual to measure certain parameters in terms of a set of spectrometric curves that are observed in a finite set of points (functional data). Although the predictor variable is clearly functional, this problem is usually solved by using multivariate calibration techniques that consider it as a finite set of variables associated with the observed points (wavelengths or times). But these explicative variables are highly correlated and it is therefore more informative to reconstruct first the true functional form of the predictor curves. Although it has been published in several articles related to the implementation of functional data analysis techniques in chemometric, their power to solve real problems is not yet well known. Because of this the extension of multivariate calibration techniques (linear regression, principal component regression and partial least squares) and classification methods (linear discriminant analysis and logistic regression) to the functional domain and some relevant chemometric applications are reviewed in this paper.
文摘目的建立同步检测畲药树参中紫丁香苷、绿原酸、芥子醛葡萄糖苷、松柏醇、芦丁、山柰酚-3-O-芸香糖苷、3,4-O-二咖啡酰基奎宁酸、3,5-O-二咖啡酰基奎宁酸和4,5-O-二咖啡酰基奎宁酸含量的高效液相色谱一测多评(HPLC-QAMS)方法,并采用多元统计分析及加权优劣解距离(technique for order preference by similarity to ideal solution method,TOPSIS)法对其品质进行综合评价。方法以Waters Xbridge C 18色谱柱;乙腈-0.05%甲酸溶液为流动相,梯度洗脱;检测波长260 nm。以山柰酚-3-O-芸香糖苷为参照物,建立内参物与其他8个待测成分的相对校正因子(relative correction factor,RCF),进行RCF耐用性考察及色谱峰定位,同时与外标法实测结果进行对比,验证HPLC-QAMS法准确性和可靠性。运用主成分分析(principal component analysis,PCA)、正交偏最小二乘法-判别分析(orthogonal partial least squares-discriminant analysis,OPLS-DA)等多元统计分析以及W-TOPSIS法对9个成分HPLC-QAMS法含量结果的相关性进行分析,挖掘影响畲药树参产品质量的主要潜在标志物,建立畲药树参综合质量优劣评价方法。结果9种成分分别在3.27~81.75μg/mL、9.85~246.25μg/mL、0.43~0.75μg/mL、0.31~7.75μg/mL、1.58~39.50μg/mL、0.59~14.75μg/mL、1.26~31.50μg/mL、4.55~113.75μg/mL和1.98~49.50μg/mL范围内线性关系良好,平均加样回收率96.82%~100.07%(RSD<2.0%);HPLC-QAMS和外标法(ESM)含量测定结果差异无统计学意义(P>0.05),HPLC-QAMS法可用于畲药树参多组分定量控制;多元统计分析结果显示,前2个主成分累计方差贡献率89.589%,绿原酸、紫丁香苷、3,5-O-二咖啡酰基奎宁酸和4,5-O-二咖啡酰基奎宁酸是影响畲药树参产品质量的主要潜在标志物;加权TOPSIS法结果显示浙江地区所得畲药树参质量最优,其次为江西、安徽、湖南和湖北产树参,云南和贵州产树参位于排名后4位。结论所建立的HPLC-QAMS多组分定量控制方法,操作便捷、结果准确;多元统计分析联合加权TOPSIS法全面客观,可用于畲药树参品质的综合评价。