Based on the two-dimensional relation table,this paper studies the missing values in the sample data of land price of Shunde District of Foshan City.GeoDa software was used to eliminate the insignificant factors by st...Based on the two-dimensional relation table,this paper studies the missing values in the sample data of land price of Shunde District of Foshan City.GeoDa software was used to eliminate the insignificant factors by stepwise regression analysis;NORM software was adopted to construct the multiple imputation models;EM algorithm and the augmentation algorithm were applied to fit multiple linear regression equations to construct five different filling datasets.Statistical analysis is performed on the imputation data set in order to calculate the mean and variance of each data set,and the weight is determined according to the differences.Finally,comprehensive integration is implemented to achieve the imputation expression of missing values.The results showed that in the three missing cases where the PRICE variable was missing and the deletion rate was 5%,the PRICE variable was missing and the deletion rate was 10%,and the PRICE variable and the CBD variable were both missing.The new method compared to the traditional multiple filling methods of true value closer ratio is 75%to 25%,62.5%to 37.5%,100%to 0%.Therefore,the new method is obviously better than the traditional multiple imputation methods,and the missing value data estimated by the new method bears certain reference value.展开更多
Multiple imputations compensate for missing data and produce multiple datasets by regression model and are considered the solver of the old problem of univariate imputation. The univariate imputes data only from a spe...Multiple imputations compensate for missing data and produce multiple datasets by regression model and are considered the solver of the old problem of univariate imputation. The univariate imputes data only from a specific column where the data cell was missing. Multivariate imputation works simultaneously, with all variables in all columns, whether missing or observed. It has emerged as a principal method of solving missing data problems. All incomplete datasets analyzed before Multiple Imputation by Chained Equations <span style="font-family:Verdana;">(MICE) presented were misdiagnosed;results obtained were invalid and should</span><span style="font-family:Verdana;"> not be countable to yield reasonable conclusions. This article will highlight why multiple imputations and how the MICE work with a particular focus on the cyber-security dataset.</span><b> </b><span style="font-family:Verdana;">Removing missing data in any dataset and replac</span><span style="font-family:Verdana;">ing it is imperative in analyzing the data and creating prediction models. Therefore,</span><span style="font-family:Verdana;"> a good imputation technique should recover the missingness, which involves extracting the good features. However, the widely used univariate imputation method does not impute missingness reasonably if the values are too large and may thus lead to bias. Therefore, we aim to propose an alternative imputation method that is efficient and removes potential bias after removing the missingness.</span>展开更多
Currently, a growing number of programs become available in statistical software for multiple imputation of missing values. Among others, two algorithms are mainly implemented: Expectation Maximization (EM) and Multip...Currently, a growing number of programs become available in statistical software for multiple imputation of missing values. Among others, two algorithms are mainly implemented: Expectation Maximization (EM) and Multiple Imputation by Chained Equations (MICE). They have been shown to work well in large samples or when only small proportions of missing data are to be imputed. However, some researchers have begun to impute large proportions of missing data or to apply the method to small samples. A simulation was performed using MICE on datasets with 50, 100 or 200 cases and four or eleven variables. A varying proportion of data (3% - 63%) was set as missing completely at random and subsequently substituted using multiple imputation by chained equations. In a logistic regression model, four coefficients, i.e. non-zero and zero main effects as well as non-zero and zero interaction effects were examined. Estimations of all main and interaction effects were unbiased. There was a considerable variance in the estimates, increasing with the proportion of missing data and decreasing with sample size. The imputation of missing data by chained equations is a useful tool for imputing small to moderate proportions of missing data. The method has its limits, however. In small samples, there are considerable random errors for all effects.展开更多
How many imputations are sufficient in multiple imputations? The answer given by different researchers varies from as few as 2 - 3 to as many as hundreds. Perhaps no single number of imputations would fit all situatio...How many imputations are sufficient in multiple imputations? The answer given by different researchers varies from as few as 2 - 3 to as many as hundreds. Perhaps no single number of imputations would fit all situations. In this study, η, the minimally sufficient number of imputations, was determined based on the relationship between m, the number of imputations, and ω, the standard error of imputation variances using the 2012 National Ambulatory Medical Care Survey (NAMCS) Physician Workflow mail survey. Five variables of various value ranges, variances, and missing data percentages were tested. For all variables tested, ω decreased as m increased. The m value above which the cost of further increase in m would outweigh the benefit of reducing ω was recognized as the η. This method has a potential to be used by anyone to determine η that fits his or her own data situation.展开更多
Multiple kernel clustering is an unsupervised data analysis method that has been used in various scenarios where data is easy to be collected but hard to be labeled.However,multiple kernel clustering for incomplete da...Multiple kernel clustering is an unsupervised data analysis method that has been used in various scenarios where data is easy to be collected but hard to be labeled.However,multiple kernel clustering for incomplete data is a critical yet challenging task.Although the existing absent multiple kernel clustering methods have achieved remarkable performance on this task,they may fail when data has a high value-missing rate,and they may easily fall into a local optimum.To address these problems,in this paper,we propose an absent multiple kernel clustering(AMKC)method on incomplete data.The AMKC method rst clusters the initialized incomplete data.Then,it constructs a new multiple-kernel-based data space,referred to as K-space,from multiple sources to learn kernel combination coefcients.Finally,it seamlessly integrates an incomplete-kernel-imputation objective,a multiple-kernel-learning objective,and a kernel-clustering objective in order to achieve absent multiple kernel clustering.The three stages in this process are carried out simultaneously until the convergence condition is met.Experiments on six datasets with various characteristics demonstrate that the kernel imputation and clustering performance of the proposed method is signicantly better than state-of-the-art competitors.Meanwhile,the proposed method gains fast convergence speed.展开更多
Identifying genetic variants that contribute to phenotypic variation is expected to provide insights into the etiology of complex traits. Here we show how combining genetic mapping in an outbred population of rats wit...Identifying genetic variants that contribute to phenotypic variation is expected to provide insights into the etiology of complex traits. Here we show how combining genetic mapping in an outbred population of rats with sequence data from the progenitors of the population made it possible to identify causal variants and genes for a large number of phenotypes. We identified 355 genomic loci contributing to 122 measures relevant to six models of disease, including fear-related behaviors and experimental autoimmune encephalomyelitis. At 35 of those loci we identified the responsible gene, and in some cases, the responsible variant.展开更多
Empirical-likelihood-based inference for parameters defined by the general estimating equations of Qin and Lawless(1994) remains an active research topic. When the response is missing at random(MAR) and the dimension ...Empirical-likelihood-based inference for parameters defined by the general estimating equations of Qin and Lawless(1994) remains an active research topic. When the response is missing at random(MAR) and the dimension of covariate is not low, the authors propose a two-stage estimation procedure by using the dimension-reduced kernel estimators in conjunction with an unbiased estimating function based on augmented inverse probability weighting and multiple imputation(AIPW-MI) methods. The authors show that the resulting estimator achieves consistency and asymptotic normality. In addition, the corresponding empirical likelihood ratio statistics asymptotically follow central chi-square distributions when evaluated at the true parameter. The finite-sample performance of the proposed estimator is studied through simulation, and an application to HIV-CD4 data set is also presented.展开更多
Due to sensor malfunctions and communication faults,multiple missing patterns frequently happen in wastewater treatment process(WWTP).Nevertheless,the existing missing data imputation works cannot stand multiple missi...Due to sensor malfunctions and communication faults,multiple missing patterns frequently happen in wastewater treatment process(WWTP).Nevertheless,the existing missing data imputation works cannot stand multiple missing patterns because they have not sufficiently utilized of data information.In this article,a double-cycle weighted imputation(DCWI)method is proposed to deal with multiple missing patterns by maximizing the utilization of the available information in variables and instances.The proposed DCWI is comprised of two components:a double-cycle-based imputation sorting and a weighted K nearest neighbor-based imputation estimator.First,the double-cycle mechanism,associated with missing variable sorting and missing instance sorting,is applied to direct the missing values imputation.Second,the weighted K nearest neighbor-based imputation estimator is used to acquire the global similar instances and capture the volatility in the local region.The estimator preserves the original data characteristics as much as possible and enhances the imputation accuracy.Finally,experimental results on simulated and real WWTP datasets with non-stationarity and nonlinearity demonstrate that the proposed DCWI produces more accurate imputation results than comparison methods under different missing patterns and missing ratios.展开更多
目的研究不同缺失率、不同缺失机制下,MICE(multivariate imputation by chained equations)多重填补的效果,探讨该填补方法的适用情况。方法依托某现况调查的完全数据,使用R软件构造不同缺失率、不同缺失机制的缺失数据。计算列表删除...目的研究不同缺失率、不同缺失机制下,MICE(multivariate imputation by chained equations)多重填补的效果,探讨该填补方法的适用情况。方法依托某现况调查的完全数据,使用R软件构造不同缺失率、不同缺失机制的缺失数据。计算列表删除和MICE多重填补后分析结果的标准偏倚,并进行比较。单独对分类变量计算多重填补后的平均错分率。结果在单变量缺失率分别为10%、20%和30%的随机缺失三种情况下,MICE多重填补表现优良;其他模拟情况下,MICE多重填补相比于列表删除并未表现出明显的优势。对于分类变量,MICE填补后的平均错分率均超过60%。结论对于随机缺失数据,且单变量缺失率不超过30%时,建议采用MICE多重填补进行处理;但对于资料中的分类变量,不建议直接引用MICE填补后的具体数值。展开更多
【目的】目前干旱研究多为基于历史干旱事件分析成因与变化趋势,而结合过去与未来长时间序列数据更能揭示干旱变化特点。寻找在基于CMIP5模型输出未来气象数据时模拟干旱指数方法并探究陕西省过去与未来干旱变化特点,为陕西省未来农业...【目的】目前干旱研究多为基于历史干旱事件分析成因与变化趋势,而结合过去与未来长时间序列数据更能揭示干旱变化特点。寻找在基于CMIP5模型输出未来气象数据时模拟干旱指数方法并探究陕西省过去与未来干旱变化特点,为陕西省未来农业水资源管理提供依据。【方法】根据陕西省18个气象站历史数据以及CMIP5模式输出未来气象数据,比较了3种模型模拟参考作物蒸发蒸腾量(ET0),并基于参考作物蒸发蒸腾量(ET0)和降水数据计算标准降水蒸发指数(SPEI)和相对湿润指数(MI)反映干旱程度,比较过去(1958-2018年)与未来(2019-2100年)干旱的时空变化特点。【结果】多元线性回归模型(Multiple Linear Regression, MLR)能较准确的模拟参考作物蒸发蒸腾量(ET0)(RMSE=0.457 mm·d^-1);在RCP2.6和RCP8.5情景下未来干旱指数呈现上升趋势,在RCP8.5情景下,21世纪40年代存在干旱指数的突变年份;陕西省未来干旱程度降低,年内干旱分布更加不均匀;未来时期夏玉米生长季干旱程度减小,冬小麦生长季干旱程度增加。【结论】在不同RCP情景下,未来干旱变化特征存在差异,相同RCP情景下,SPEI和MI反映的干旱特征变化基本一致,但部分时段存在变化差异。为有效应对气候变化对旱作作物产量造成的负面影响,应当增强土壤蓄水保墒能力,尤其加强冬小麦生长季的抗旱工作。展开更多
AIM To assess risk factors of hospital admission for acute colonic diverticulitis.METHODS The study was conducted as part of the second wave of the population-based North Trondelag Health Study(HUNT2), performed in No...AIM To assess risk factors of hospital admission for acute colonic diverticulitis.METHODS The study was conducted as part of the second wave of the population-based North Trondelag Health Study(HUNT2), performed in North Trondelag County, Norway, 1995 to 1997. The study consisted of 42570 participants(65.1% from HUNT2) who were followed up from 1998 to 2012. Of these, 22436(52.7%) were females. The cases were defined as those 358 participants admitted with acute colonic diverticulitis during follow-up. The remaining participants were used as controls. Univariable and multivariable Cox regression analyses was used for each sex separately after multiple imputation to calculate HR.RESULTS Multivariable Cox regression analyses showed that increasing age increased the risk of admission for acute colonic diverticulitis: Comparing with ages < 50 years, females with age 50-70 years had HR = 3.42, P < 0.001 and age > 70 years, HR = 6.19, P < 0.001. In males the corresponding values were HR = 1.85, P = 0.004 and 2.56, P < 0.001. In patients with obesity(body mass index ≥ 30) the HR = 2.06, P < 0.001 in females and HR = 2.58, P < 0.001 in males. In females, present(HR = 2.11, P < 0.001) or previous(HR = 1.65, P = 0.007) cigarette smoking increased the risk of admission. In males, breathlessness(HR = 2.57, P < 0.001) and living in rural areas(HR = 1.74, P = 0.007) increased the risk. Level of education, physical activity, constipation and type of bread eaten showed no association with admission for acute colonic diverticulitis.CONCLUSION The risk of hospital admission for acute colonic diverticulitis increased with increasing age, in obese individuals, in ever cigarette smoking females and in males living in rural areas.展开更多
Missing data can frequently occur in a longitudinal data analysis. In the literature, many methods have been proposed to handle such an issue. Complete case (CC), mean substitution (MS), last observation carried forwa...Missing data can frequently occur in a longitudinal data analysis. In the literature, many methods have been proposed to handle such an issue. Complete case (CC), mean substitution (MS), last observation carried forward (LOCF), and multiple imputation (MI) are the four most frequently used methods in practice. In a real-world data analysis, the missing data can be MCAR, MAR, or MNAR depending on the reasons that lead to data missing. In this paper, simulations under various situations (including missing mechanisms, missing rates, and slope sizes) were conducted to evaluate the performance of the four methods considered using bias, RMSE, and 95% coverage probability as evaluation criteria. The results showed that LOCF has the largest bias and the poorest 95% coverage probability in most cases under both MAR and MCAR missing mechanisms. Hence, LOCF should not be used in a longitudinal data analysis. Under MCAR missing mechanism, CC and MI method are performed equally well. Under MAR missing mechanism, MI has the smallest bias, smallest RMSE, and best 95% coverage probability. Therefore, CC or MI method is the appropriate method to be used under MCAR while MI method is a more reliable and a better grounded statistical method to be used under MAR.展开更多
In his 1987 classic book on multiple imputation (MI), Rubin used the fraction of missing information, γ, to define the relative efficiency (RE) of MI as RE = (1 + γ/m)?1/2, where m is the number of imputations, lead...In his 1987 classic book on multiple imputation (MI), Rubin used the fraction of missing information, γ, to define the relative efficiency (RE) of MI as RE = (1 + γ/m)?1/2, where m is the number of imputations, leading to the conclusion that a small m (≤5) would be sufficient for MI. However, evidence has been accumulating that many more imputations are needed. Why would the apparently sufficient m deduced from the RE be actually too small? The answer may lie with γ. In this research, γ was determined at the fractions of missing data (δ) of 4%, 10%, 20%, and 29% using the 2012 Physician Workflow Mail Survey of the National Ambulatory Medical Care Survey (NAMCS). The γ values were strikingly small, ranging in the order of 10?6 to 0.01. As δ increased, γ usually increased but sometimes decreased. How the data were analysed had the dominating effects on γ, overshadowing the effect of δ. The results suggest that it is impossible to predict γ using δ and that it may not be appropriate to use the γ-based RE to determine sufficient m.展开更多
For the issue of deterioration in detection performance caused by dynamically changing environment in ultra-wideband(UWB) multiple input multiple output(MIMO) radar, this paper proposes a novel adaptive waveform d...For the issue of deterioration in detection performance caused by dynamically changing environment in ultra-wideband(UWB) multiple input multiple output(MIMO) radar, this paper proposes a novel adaptive waveform design which is aimed to improve the ability of discriminating target and clutter from the radar scene. Firstly, a sequence of Morlet wavelet pulses with frequency hopping and pulse position modulation by Welch-Costas array is designed. Then a waveform optimization solution is proposed which is achieved by applying the minimization mutual-information(MI) strategy. After that, with subsequent iterations of the algorithm, simulation results demonstrate that the optimal waveform design method brings an improvement in the target detection ability in the presence of noise and clutter.展开更多
Multiple myeloma(MM) is a common malignant hematological disease. Dysregulation of micro RNAs(mi RNAs) in MM cells and bone marrow microenviroment has important impacts on the initiation and progression of MM and drug...Multiple myeloma(MM) is a common malignant hematological disease. Dysregulation of micro RNAs(mi RNAs) in MM cells and bone marrow microenviroment has important impacts on the initiation and progression of MM and drug resistance in MM cells. Recently, it was reported that MM patient serum and plasma contained sufficiently stable mi RNA signatures, and circulating mi RNAs could be identified and measured accurately from body fluid. Compared to conventional diagnostic parameters, the circulating mi RNA profile is appropriate for the diagnosis of MM and estimates patient progression and therapeutic outcome with higher specificity and sensitivity. In this review, we mainly focus on the potential of circulating mi RNAs as diagnostic, prognostic, and predictive biomarkers for MM and summarize the general strategies and methodologies for identification and measurement of circulating mi RNAs in various cancers. Furthermore, we discuss the correlation between circulating mi RNAs and the cytogenetic abnormalities and biochemical parameters assessed in multiple myeloma.展开更多
基金This research was financially supported by FDCT NO.005/2018/A1also supported by Guangdong Provincial Innovation and Entrepreneurship Training Program Project No.201713719017College Students Innovation Training Program held by Guangdong university of Science and Technology Nos.1711034,1711080,and No.1711088.
文摘Based on the two-dimensional relation table,this paper studies the missing values in the sample data of land price of Shunde District of Foshan City.GeoDa software was used to eliminate the insignificant factors by stepwise regression analysis;NORM software was adopted to construct the multiple imputation models;EM algorithm and the augmentation algorithm were applied to fit multiple linear regression equations to construct five different filling datasets.Statistical analysis is performed on the imputation data set in order to calculate the mean and variance of each data set,and the weight is determined according to the differences.Finally,comprehensive integration is implemented to achieve the imputation expression of missing values.The results showed that in the three missing cases where the PRICE variable was missing and the deletion rate was 5%,the PRICE variable was missing and the deletion rate was 10%,and the PRICE variable and the CBD variable were both missing.The new method compared to the traditional multiple filling methods of true value closer ratio is 75%to 25%,62.5%to 37.5%,100%to 0%.Therefore,the new method is obviously better than the traditional multiple imputation methods,and the missing value data estimated by the new method bears certain reference value.
文摘Multiple imputations compensate for missing data and produce multiple datasets by regression model and are considered the solver of the old problem of univariate imputation. The univariate imputes data only from a specific column where the data cell was missing. Multivariate imputation works simultaneously, with all variables in all columns, whether missing or observed. It has emerged as a principal method of solving missing data problems. All incomplete datasets analyzed before Multiple Imputation by Chained Equations <span style="font-family:Verdana;">(MICE) presented were misdiagnosed;results obtained were invalid and should</span><span style="font-family:Verdana;"> not be countable to yield reasonable conclusions. This article will highlight why multiple imputations and how the MICE work with a particular focus on the cyber-security dataset.</span><b> </b><span style="font-family:Verdana;">Removing missing data in any dataset and replac</span><span style="font-family:Verdana;">ing it is imperative in analyzing the data and creating prediction models. Therefore,</span><span style="font-family:Verdana;"> a good imputation technique should recover the missingness, which involves extracting the good features. However, the widely used univariate imputation method does not impute missingness reasonably if the values are too large and may thus lead to bias. Therefore, we aim to propose an alternative imputation method that is efficient and removes potential bias after removing the missingness.</span>
基金supported by the Stiftung Rheinland-Pfalz fur Innovation(959).
文摘Currently, a growing number of programs become available in statistical software for multiple imputation of missing values. Among others, two algorithms are mainly implemented: Expectation Maximization (EM) and Multiple Imputation by Chained Equations (MICE). They have been shown to work well in large samples or when only small proportions of missing data are to be imputed. However, some researchers have begun to impute large proportions of missing data or to apply the method to small samples. A simulation was performed using MICE on datasets with 50, 100 or 200 cases and four or eleven variables. A varying proportion of data (3% - 63%) was set as missing completely at random and subsequently substituted using multiple imputation by chained equations. In a logistic regression model, four coefficients, i.e. non-zero and zero main effects as well as non-zero and zero interaction effects were examined. Estimations of all main and interaction effects were unbiased. There was a considerable variance in the estimates, increasing with the proportion of missing data and decreasing with sample size. The imputation of missing data by chained equations is a useful tool for imputing small to moderate proportions of missing data. The method has its limits, however. In small samples, there are considerable random errors for all effects.
文摘How many imputations are sufficient in multiple imputations? The answer given by different researchers varies from as few as 2 - 3 to as many as hundreds. Perhaps no single number of imputations would fit all situations. In this study, η, the minimally sufficient number of imputations, was determined based on the relationship between m, the number of imputations, and ω, the standard error of imputation variances using the 2012 National Ambulatory Medical Care Survey (NAMCS) Physician Workflow mail survey. Five variables of various value ranges, variances, and missing data percentages were tested. For all variables tested, ω decreased as m increased. The m value above which the cost of further increase in m would outweigh the benefit of reducing ω was recognized as the η. This method has a potential to be used by anyone to determine η that fits his or her own data situation.
基金funded by National Natural Science Foundation of China under Grant Nos.61972057 and U1836208Hunan Provincial Natural Science Foundation of China under Grant No.2019JJ50655+3 种基金Scientic Research Foundation of Hunan Provincial Education Department of China under Grant No.18B160Open Fund of Hunan Key Laboratory of Smart Roadway and Cooperative Vehicle Infrastructure Systems(Changsha University of Science and Technology)under Grant No.kfj180402the“Double First-class”International Cooperation and Development Scientic Research Project of Changsha University of Science and Technology under Grant No.2018IC25the Researchers Supporting Project No.(RSP-2020/102)King Saud University,Riyadh,Saudi Arabia.
文摘Multiple kernel clustering is an unsupervised data analysis method that has been used in various scenarios where data is easy to be collected but hard to be labeled.However,multiple kernel clustering for incomplete data is a critical yet challenging task.Although the existing absent multiple kernel clustering methods have achieved remarkable performance on this task,they may fail when data has a high value-missing rate,and they may easily fall into a local optimum.To address these problems,in this paper,we propose an absent multiple kernel clustering(AMKC)method on incomplete data.The AMKC method rst clusters the initialized incomplete data.Then,it constructs a new multiple-kernel-based data space,referred to as K-space,from multiple sources to learn kernel combination coefcients.Finally,it seamlessly integrates an incomplete-kernel-imputation objective,a multiple-kernel-learning objective,and a kernel-clustering objective in order to achieve absent multiple kernel clustering.The three stages in this process are carried out simultaneously until the convergence condition is met.Experiments on six datasets with various characteristics demonstrate that the kernel imputation and clustering performance of the proposed method is signicantly better than state-of-the-art competitors.Meanwhile,the proposed method gains fast convergence speed.
文摘Identifying genetic variants that contribute to phenotypic variation is expected to provide insights into the etiology of complex traits. Here we show how combining genetic mapping in an outbred population of rats with sequence data from the progenitors of the population made it possible to identify causal variants and genes for a large number of phenotypes. We identified 355 genomic loci contributing to 122 measures relevant to six models of disease, including fear-related behaviors and experimental autoimmune encephalomyelitis. At 35 of those loci we identified the responsible gene, and in some cases, the responsible variant.
基金supported by the National Natural Science Foundation of China under Grant Nos.11871287,11501208,11771144,11801359the Natural Science Foundation of Tianjin under Grant No.18JCYBJC41100+1 种基金Fundamental Research Funds for the Central Universitiesthe Key Laboratory for Medical Data Analysis and Statistical Research of Tianjin。
文摘Empirical-likelihood-based inference for parameters defined by the general estimating equations of Qin and Lawless(1994) remains an active research topic. When the response is missing at random(MAR) and the dimension of covariate is not low, the authors propose a two-stage estimation procedure by using the dimension-reduced kernel estimators in conjunction with an unbiased estimating function based on augmented inverse probability weighting and multiple imputation(AIPW-MI) methods. The authors show that the resulting estimator achieves consistency and asymptotic normality. In addition, the corresponding empirical likelihood ratio statistics asymptotically follow central chi-square distributions when evaluated at the true parameter. The finite-sample performance of the proposed estimator is studied through simulation, and an application to HIV-CD4 data set is also presented.
基金supported by the National Key Research and Development Project(Grant No.2018YFC1900800-5)the National Natural Science Foundation of China(Grant Nos.61890930-5,61903010,62021003 and 62125301)+1 种基金Beijing Natural Science Foundation(Grant No.KZ202110005009)Beijing Outstanding Young Scientist Program(Grant No.BJJWZYJH 01201910005020)。
文摘Due to sensor malfunctions and communication faults,multiple missing patterns frequently happen in wastewater treatment process(WWTP).Nevertheless,the existing missing data imputation works cannot stand multiple missing patterns because they have not sufficiently utilized of data information.In this article,a double-cycle weighted imputation(DCWI)method is proposed to deal with multiple missing patterns by maximizing the utilization of the available information in variables and instances.The proposed DCWI is comprised of two components:a double-cycle-based imputation sorting and a weighted K nearest neighbor-based imputation estimator.First,the double-cycle mechanism,associated with missing variable sorting and missing instance sorting,is applied to direct the missing values imputation.Second,the weighted K nearest neighbor-based imputation estimator is used to acquire the global similar instances and capture the volatility in the local region.The estimator preserves the original data characteristics as much as possible and enhances the imputation accuracy.Finally,experimental results on simulated and real WWTP datasets with non-stationarity and nonlinearity demonstrate that the proposed DCWI produces more accurate imputation results than comparison methods under different missing patterns and missing ratios.
文摘目的研究不同缺失率、不同缺失机制下,MICE(multivariate imputation by chained equations)多重填补的效果,探讨该填补方法的适用情况。方法依托某现况调查的完全数据,使用R软件构造不同缺失率、不同缺失机制的缺失数据。计算列表删除和MICE多重填补后分析结果的标准偏倚,并进行比较。单独对分类变量计算多重填补后的平均错分率。结果在单变量缺失率分别为10%、20%和30%的随机缺失三种情况下,MICE多重填补表现优良;其他模拟情况下,MICE多重填补相比于列表删除并未表现出明显的优势。对于分类变量,MICE填补后的平均错分率均超过60%。结论对于随机缺失数据,且单变量缺失率不超过30%时,建议采用MICE多重填补进行处理;但对于资料中的分类变量,不建议直接引用MICE填补后的具体数值。
文摘【目的】目前干旱研究多为基于历史干旱事件分析成因与变化趋势,而结合过去与未来长时间序列数据更能揭示干旱变化特点。寻找在基于CMIP5模型输出未来气象数据时模拟干旱指数方法并探究陕西省过去与未来干旱变化特点,为陕西省未来农业水资源管理提供依据。【方法】根据陕西省18个气象站历史数据以及CMIP5模式输出未来气象数据,比较了3种模型模拟参考作物蒸发蒸腾量(ET0),并基于参考作物蒸发蒸腾量(ET0)和降水数据计算标准降水蒸发指数(SPEI)和相对湿润指数(MI)反映干旱程度,比较过去(1958-2018年)与未来(2019-2100年)干旱的时空变化特点。【结果】多元线性回归模型(Multiple Linear Regression, MLR)能较准确的模拟参考作物蒸发蒸腾量(ET0)(RMSE=0.457 mm·d^-1);在RCP2.6和RCP8.5情景下未来干旱指数呈现上升趋势,在RCP8.5情景下,21世纪40年代存在干旱指数的突变年份;陕西省未来干旱程度降低,年内干旱分布更加不均匀;未来时期夏玉米生长季干旱程度减小,冬小麦生长季干旱程度增加。【结论】在不同RCP情景下,未来干旱变化特征存在差异,相同RCP情景下,SPEI和MI反映的干旱特征变化基本一致,但部分时段存在变化差异。为有效应对气候变化对旱作作物产量造成的负面影响,应当增强土壤蓄水保墒能力,尤其加强冬小麦生长季的抗旱工作。
基金Supported by Institute of Cancer Research and Molecular Medicine,The Medical Faculty,Norwegian University of Science and Technology,Trondheim,Norwaythe Department of Research,Levanger Hospital,Levanger
文摘AIM To assess risk factors of hospital admission for acute colonic diverticulitis.METHODS The study was conducted as part of the second wave of the population-based North Trondelag Health Study(HUNT2), performed in North Trondelag County, Norway, 1995 to 1997. The study consisted of 42570 participants(65.1% from HUNT2) who were followed up from 1998 to 2012. Of these, 22436(52.7%) were females. The cases were defined as those 358 participants admitted with acute colonic diverticulitis during follow-up. The remaining participants were used as controls. Univariable and multivariable Cox regression analyses was used for each sex separately after multiple imputation to calculate HR.RESULTS Multivariable Cox regression analyses showed that increasing age increased the risk of admission for acute colonic diverticulitis: Comparing with ages < 50 years, females with age 50-70 years had HR = 3.42, P < 0.001 and age > 70 years, HR = 6.19, P < 0.001. In males the corresponding values were HR = 1.85, P = 0.004 and 2.56, P < 0.001. In patients with obesity(body mass index ≥ 30) the HR = 2.06, P < 0.001 in females and HR = 2.58, P < 0.001 in males. In females, present(HR = 2.11, P < 0.001) or previous(HR = 1.65, P = 0.007) cigarette smoking increased the risk of admission. In males, breathlessness(HR = 2.57, P < 0.001) and living in rural areas(HR = 1.74, P = 0.007) increased the risk. Level of education, physical activity, constipation and type of bread eaten showed no association with admission for acute colonic diverticulitis.CONCLUSION The risk of hospital admission for acute colonic diverticulitis increased with increasing age, in obese individuals, in ever cigarette smoking females and in males living in rural areas.
文摘Missing data can frequently occur in a longitudinal data analysis. In the literature, many methods have been proposed to handle such an issue. Complete case (CC), mean substitution (MS), last observation carried forward (LOCF), and multiple imputation (MI) are the four most frequently used methods in practice. In a real-world data analysis, the missing data can be MCAR, MAR, or MNAR depending on the reasons that lead to data missing. In this paper, simulations under various situations (including missing mechanisms, missing rates, and slope sizes) were conducted to evaluate the performance of the four methods considered using bias, RMSE, and 95% coverage probability as evaluation criteria. The results showed that LOCF has the largest bias and the poorest 95% coverage probability in most cases under both MAR and MCAR missing mechanisms. Hence, LOCF should not be used in a longitudinal data analysis. Under MCAR missing mechanism, CC and MI method are performed equally well. Under MAR missing mechanism, MI has the smallest bias, smallest RMSE, and best 95% coverage probability. Therefore, CC or MI method is the appropriate method to be used under MCAR while MI method is a more reliable and a better grounded statistical method to be used under MAR.
文摘In his 1987 classic book on multiple imputation (MI), Rubin used the fraction of missing information, γ, to define the relative efficiency (RE) of MI as RE = (1 + γ/m)?1/2, where m is the number of imputations, leading to the conclusion that a small m (≤5) would be sufficient for MI. However, evidence has been accumulating that many more imputations are needed. Why would the apparently sufficient m deduced from the RE be actually too small? The answer may lie with γ. In this research, γ was determined at the fractions of missing data (δ) of 4%, 10%, 20%, and 29% using the 2012 Physician Workflow Mail Survey of the National Ambulatory Medical Care Survey (NAMCS). The γ values were strikingly small, ranging in the order of 10?6 to 0.01. As δ increased, γ usually increased but sometimes decreased. How the data were analysed had the dominating effects on γ, overshadowing the effect of δ. The results suggest that it is impossible to predict γ using δ and that it may not be appropriate to use the γ-based RE to determine sufficient m.
基金supported by the National Natural Science Foundation of China(6107114561271331)
文摘For the issue of deterioration in detection performance caused by dynamically changing environment in ultra-wideband(UWB) multiple input multiple output(MIMO) radar, this paper proposes a novel adaptive waveform design which is aimed to improve the ability of discriminating target and clutter from the radar scene. Firstly, a sequence of Morlet wavelet pulses with frequency hopping and pulse position modulation by Welch-Costas array is designed. Then a waveform optimization solution is proposed which is achieved by applying the minimization mutual-information(MI) strategy. After that, with subsequent iterations of the algorithm, simulation results demonstrate that the optimal waveform design method brings an improvement in the target detection ability in the presence of noise and clutter.
基金supported by the National Natural Science Foundation of China(8130177481470362)
文摘Multiple myeloma(MM) is a common malignant hematological disease. Dysregulation of micro RNAs(mi RNAs) in MM cells and bone marrow microenviroment has important impacts on the initiation and progression of MM and drug resistance in MM cells. Recently, it was reported that MM patient serum and plasma contained sufficiently stable mi RNA signatures, and circulating mi RNAs could be identified and measured accurately from body fluid. Compared to conventional diagnostic parameters, the circulating mi RNA profile is appropriate for the diagnosis of MM and estimates patient progression and therapeutic outcome with higher specificity and sensitivity. In this review, we mainly focus on the potential of circulating mi RNAs as diagnostic, prognostic, and predictive biomarkers for MM and summarize the general strategies and methodologies for identification and measurement of circulating mi RNAs in various cancers. Furthermore, we discuss the correlation between circulating mi RNAs and the cytogenetic abnormalities and biochemical parameters assessed in multiple myeloma.