Compositional data, such as relative information, is a crucial aspect of machine learning and other related fields. It is typically recorded as closed data or sums to a constant, like 100%. The statistical linear mode...Compositional data, such as relative information, is a crucial aspect of machine learning and other related fields. It is typically recorded as closed data or sums to a constant, like 100%. The statistical linear model is the most used technique for identifying hidden relationships between underlying random variables of interest. However, data quality is a significant challenge in machine learning, especially when missing data is present. The linear regression model is a commonly used statistical modeling technique used in various applications to find relationships between variables of interest. When estimating linear regression parameters which are useful for things like future prediction and partial effects analysis of independent variables, maximum likelihood estimation (MLE) is the method of choice. However, many datasets contain missing observations, which can lead to costly and time-consuming data recovery. To address this issue, the expectation-maximization (EM) algorithm has been suggested as a solution for situations including missing data. The EM algorithm repeatedly finds the best estimates of parameters in statistical models that depend on variables or data that have not been observed. This is called maximum likelihood or maximum a posteriori (MAP). Using the present estimate as input, the expectation (E) step constructs a log-likelihood function. Finding the parameters that maximize the anticipated log-likelihood, as determined in the E step, is the job of the maximization (M) phase. This study looked at how well the EM algorithm worked on a made-up compositional dataset with missing observations. It used both the robust least square version and ordinary least square regression techniques. The efficacy of the EM algorithm was compared with two alternative imputation techniques, k-Nearest Neighbor (k-NN) and mean imputation (), in terms of Aitchison distances and covariance.展开更多
A class of pseudo distances is used to derive test statistics using transformed data or spacings for testing goodness-of-fit for parametric models. These statistics can be considered as density based statistics and ex...A class of pseudo distances is used to derive test statistics using transformed data or spacings for testing goodness-of-fit for parametric models. These statistics can be considered as density based statistics and expressible as simple functions of spacings. It is known that when the null hypothesis is simple, the statistics follow asymptotic normal distributions without unknown parameters. In this paper we emphasize results for the null composite hypothesis: the parameters can be estimated by a generalized spacing method (GSP) first which is equivalent to minimize a pseudo distance from the class which is considered;subsequently the estimated parameters are used to replace the parameters in the pseudo distance used for estimation;goodness-of-fit statistics for the composite hypothesis can be constructed and shown to have again an asymptotic normal distribution without unknown parameters. Since these statistics are related to a discrepancy measure, these tests can be shown to be consistent in general. Furthermore, due to the simplicity of these statistics and they come a no extra cost after fitting the model, they can be considered as alternative statistics to chi-square statistics which require a choice of intervals and statistics based on empirical distribution (EDF) using the original data with a complicated null distribution which might depend on the parametric family being considered and also might depend on the vector of true parameters but EDF tests might be more powerful against some specific models which are specified by the alternative hypothesis.展开更多
[目的]研究鄂尔多斯地区生态格局以及在全球变化下的自然演变规律,揭示中国西部矿区人工扰动生态环境的时空变化。[方法]利用1982—2012年GIMMS NDVI 3g数据集和年均气温、降水量等气象数据,分别进行最大值合成、反距离加权法插值、线...[目的]研究鄂尔多斯地区生态格局以及在全球变化下的自然演变规律,揭示中国西部矿区人工扰动生态环境的时空变化。[方法]利用1982—2012年GIMMS NDVI 3g数据集和年均气温、降水量等气象数据,分别进行最大值合成、反距离加权法插值、线性回归与变化率分析、相关性分析等处理,揭示植被覆盖的时空变化趋势下蕴含的植物生理学机理,及其对气温和降水变化趋势的响应特征。[结果]鄂尔多斯地区植被返青期(start of season,SOS)始于4月下旬,枯黄期(end of season,EOS)结束于11月上旬,植被生长期(duration of season,DOS)NDVI初始阈值为0.12,平均生长期为198d;31a间鄂尔多斯地区植被绿度变化率(slope)为0.0023,植被变化趋势逐像元回归分析表明研究区80.8%的植被有轻微改善;31a间鄂尔多斯地区NDVI变化与年均气温和降水量的相关性分别为0.054和0.400。[结论]31a间鄂尔多斯地区植被返青期有提前趋势,枯黄期有滞后趋势,生长期有延长趋势;研究区大部分区域植被均有轻微改善;年均气温与降水量均呈现升高趋势,NDVI变化受温度和降水的共同作用,且NDVI最大值增高与年均降水量增加相关性较高,与年均气温升高相关性较低。展开更多
以甘肃省临泽县境内的4种生态型芦苇(Phragmites communis Trin。)的18个样品为材料,采用PCR扩增技术,以通用引物"matK-FF74",和"matK-trnK-2R"扩增matK/trnK序列,对纯化后的产物进行序列测定和分析。序列比对软件...以甘肃省临泽县境内的4种生态型芦苇(Phragmites communis Trin。)的18个样品为材料,采用PCR扩增技术,以通用引物"matK-FF74",和"matK-trnK-2R"扩增matK/trnK序列,对纯化后的产物进行序列测定和分析。序列比对软件为Clustal W,系统发育软件分析为PAUP4b10并用以构建MP和ML发育树和计算遗传距离,外群为芦竹(Arundo donax)。结果表明:matK/trnK序列长为1745~1753 bp,含有91个简约信息位点和120个可变位点,获得3个严格一致树。MP树和ML树共同说明水生芦苇是最古老的类群,它与沙丘芦苇之间的遗传距离最远,重度盐渍过渡型芦苇和轻度盐渍过渡型芦苇属于中间过渡类群。展开更多
文摘Compositional data, such as relative information, is a crucial aspect of machine learning and other related fields. It is typically recorded as closed data or sums to a constant, like 100%. The statistical linear model is the most used technique for identifying hidden relationships between underlying random variables of interest. However, data quality is a significant challenge in machine learning, especially when missing data is present. The linear regression model is a commonly used statistical modeling technique used in various applications to find relationships between variables of interest. When estimating linear regression parameters which are useful for things like future prediction and partial effects analysis of independent variables, maximum likelihood estimation (MLE) is the method of choice. However, many datasets contain missing observations, which can lead to costly and time-consuming data recovery. To address this issue, the expectation-maximization (EM) algorithm has been suggested as a solution for situations including missing data. The EM algorithm repeatedly finds the best estimates of parameters in statistical models that depend on variables or data that have not been observed. This is called maximum likelihood or maximum a posteriori (MAP). Using the present estimate as input, the expectation (E) step constructs a log-likelihood function. Finding the parameters that maximize the anticipated log-likelihood, as determined in the E step, is the job of the maximization (M) phase. This study looked at how well the EM algorithm worked on a made-up compositional dataset with missing observations. It used both the robust least square version and ordinary least square regression techniques. The efficacy of the EM algorithm was compared with two alternative imputation techniques, k-Nearest Neighbor (k-NN) and mean imputation (), in terms of Aitchison distances and covariance.
文摘A class of pseudo distances is used to derive test statistics using transformed data or spacings for testing goodness-of-fit for parametric models. These statistics can be considered as density based statistics and expressible as simple functions of spacings. It is known that when the null hypothesis is simple, the statistics follow asymptotic normal distributions without unknown parameters. In this paper we emphasize results for the null composite hypothesis: the parameters can be estimated by a generalized spacing method (GSP) first which is equivalent to minimize a pseudo distance from the class which is considered;subsequently the estimated parameters are used to replace the parameters in the pseudo distance used for estimation;goodness-of-fit statistics for the composite hypothesis can be constructed and shown to have again an asymptotic normal distribution without unknown parameters. Since these statistics are related to a discrepancy measure, these tests can be shown to be consistent in general. Furthermore, due to the simplicity of these statistics and they come a no extra cost after fitting the model, they can be considered as alternative statistics to chi-square statistics which require a choice of intervals and statistics based on empirical distribution (EDF) using the original data with a complicated null distribution which might depend on the parametric family being considered and also might depend on the vector of true parameters but EDF tests might be more powerful against some specific models which are specified by the alternative hypothesis.
文摘[目的]研究鄂尔多斯地区生态格局以及在全球变化下的自然演变规律,揭示中国西部矿区人工扰动生态环境的时空变化。[方法]利用1982—2012年GIMMS NDVI 3g数据集和年均气温、降水量等气象数据,分别进行最大值合成、反距离加权法插值、线性回归与变化率分析、相关性分析等处理,揭示植被覆盖的时空变化趋势下蕴含的植物生理学机理,及其对气温和降水变化趋势的响应特征。[结果]鄂尔多斯地区植被返青期(start of season,SOS)始于4月下旬,枯黄期(end of season,EOS)结束于11月上旬,植被生长期(duration of season,DOS)NDVI初始阈值为0.12,平均生长期为198d;31a间鄂尔多斯地区植被绿度变化率(slope)为0.0023,植被变化趋势逐像元回归分析表明研究区80.8%的植被有轻微改善;31a间鄂尔多斯地区NDVI变化与年均气温和降水量的相关性分别为0.054和0.400。[结论]31a间鄂尔多斯地区植被返青期有提前趋势,枯黄期有滞后趋势,生长期有延长趋势;研究区大部分区域植被均有轻微改善;年均气温与降水量均呈现升高趋势,NDVI变化受温度和降水的共同作用,且NDVI最大值增高与年均降水量增加相关性较高,与年均气温升高相关性较低。