The development of defect prediction plays a significant role in improving software quality. Such predictions are used to identify defective modules before the testing and to minimize the time and cost. The software w...The development of defect prediction plays a significant role in improving software quality. Such predictions are used to identify defective modules before the testing and to minimize the time and cost. The software with defects negatively impacts operational costs and finally affects customer satisfaction. Numerous approaches exist to predict software defects. However, the timely and accurate software bugs are the major challenging issues. To improve the timely and accurate software defect prediction, a novel technique called Nonparametric Statistical feature scaled QuAdratic regressive convolution Deep nEural Network (SQADEN) is introduced. The proposed SQADEN technique mainly includes two major processes namely metric or feature selection and classification. First, the SQADEN uses the nonparametric statistical Torgerson–Gower scaling technique for identifying the relevant software metrics by measuring the similarity using the dice coefficient. The feature selection process is used to minimize the time complexity of software fault prediction. With the selected metrics, software fault perdition with the help of the Quadratic Censored regressive convolution deep neural network-based classification. The deep learning classifier analyzes the training and testing samples using the contingency correlation coefficient. The softstep activation function is used to provide the final fault prediction results. To minimize the error, the Nelder–Mead method is applied to solve non-linear least-squares problems. Finally, accurate classification results with a minimum error are obtained at the output layer. Experimental evaluation is carried out with different quantitative metrics such as accuracy, precision, recall, F-measure, and time complexity. The analyzed results demonstrate the superior performance of our proposed SQADEN technique with maximum accuracy, sensitivity and specificity by 3%, 3%, 2% and 3% and minimum time and space by 13% and 15% when compared with the two state-of-the-art methods.展开更多
Normality testing is a fundamental hypothesis test in the statistical analysis of key biological indicators of diabetes.If this assumption is violated,it may cause the test results to deviate from the true value,leadi...Normality testing is a fundamental hypothesis test in the statistical analysis of key biological indicators of diabetes.If this assumption is violated,it may cause the test results to deviate from the true value,leading to incorrect inferences and conclusions,and ultimately affecting the validity and accuracy of statistical inferences.Considering this,the study designs a unified analysis scheme for different data types based on parametric statistical test methods and non-parametric test methods.The data were grouped according to sample type and divided into discrete data and continuous data.To account for differences among subgroups,the conventional chi-squared test was used for discrete data.The normal distribution is the basis of many statistical methods;if the data does not follow a normal distribution,many statistical methods will fail or produce incorrect results.Therefore,before data analysis and modeling,the data were divided into normal and non-normal groups through normality testing.For normally distributed data,parametric statistical methods were used to judge the differences between groups.For non-normal data,non-parametric tests were employed to improve the accuracy of the analysis.Statistically significant indicators were retained according to the significance index P-value of the statistical test or corresponding statistics.These indicators were then combined with relevant medical background to further explore the etiology leading to the occurrence or transformation of diabetes status.展开更多
It is well known that the nonparametric estimation of the regression function is highly sensitive to the presence of even a small proportion of outliers in the data.To solve the problem of typical observations when th...It is well known that the nonparametric estimation of the regression function is highly sensitive to the presence of even a small proportion of outliers in the data.To solve the problem of typical observations when the covariates of the nonparametric component are functional,the robust estimates for the regression parameter and regression operator are introduced.The main propose of the paper is to consider data-driven methods of selecting the number of neighbors in order to make the proposed processes fully automatic.We use thek Nearest Neighbors procedure(kNN)to construct the kernel estimator of the proposed robust model.Under some regularity conditions,we state consistency results for kNN functional estimators,which are uniform in the number of neighbors(UINN).Furthermore,a simulation study and an empirical application to a real data analysis of octane gasoline predictions are carried out to illustrate the higher predictive performances and the usefulness of the kNN approach.展开更多
The Cochran-Mantel-Haenszel (CMH) test, developed in the 1950s, is a classic in health research, especially in epidemiology and other fields in which dichotomous and polytomous variables are frequent. This nonparametr...The Cochran-Mantel-Haenszel (CMH) test, developed in the 1950s, is a classic in health research, especially in epidemiology and other fields in which dichotomous and polytomous variables are frequent. This nonparametric test makes it possible to measure and check the effect of an antecedent variable X on a health outcome Y, statistically controlling the effect of a third variable Z that acts as a confounding variable in the relationship between X and Y. Both X and Y are measured on a dichotomous qualitative scale and Z on a polytomous-qualitative or ordinal scale. It is assumed that the effect of X on Y is homogeneous between the k strata of Z, which is usually tested by the Breslow-Day test with the Tarone’s correction or the Woolf’s test. The main statistical programs have the CMH test together with a test to verify the assumption of a homogeneous effect across the strata, so that it is easy to apply. However, its fundamentals and details of calculations are a mystery to most researchers, and even difficult to find or understand. The aim of this article is to present these details in a clear and concise way, including the assumptions and alternatives to non-compliance. This technical knowledge is applied to a simulated realistic example of the area of epidemiology in health and, finally, an interpretive synthesis of the analyses is given. In addition, some suggestions for the test report are made.展开更多
Cardiovascular disease(CVD)is the major cause of death in many regions around the world,and several of its risk factors might be linked to diets.To improve public health and the understanding of this topic,we look at ...Cardiovascular disease(CVD)is the major cause of death in many regions around the world,and several of its risk factors might be linked to diets.To improve public health and the understanding of this topic,we look at the recent Minnesota Coronary Experiment(MCE)analysis that used t-test and Cox model to evaluate CVD risks.However,these parametric methods might suffer from three problems:small sample size,right-censored bias,and lack of long-term evidence.To overcome the first of these challenges,we utilize a nonparametric permutation test to examine the relationship between dietary fats and serum total cholesterol.To address the second problem,we use a resampling-based rank test to examine whether the serum total cholesterol level affects CVD deaths.For the third issue,we use some extra-Framingham Heart Study(FHS)data with an A/B test to look for meta-relationship between diets,risk factors,and CVD risks.We show that,firstly,the link between low saturated fat diets and reduction in serum total cholesterol is strong.Secondly,reducing serum total cholesterol does not robustly have an impact on CVD hazards in the diet group.Lastly,the A/B test result suggests a more complicated relationship regarding abnormal diastolic blood pressure ranges caused by diets and how these might affect the associative link between the cholesterol level and heart disease risks.This study not only helps us to deeply analyze the MCE data but also,in combination with the long-term FHS data,reveals possible complex relationships behind diets,risk factors,and heart disease.展开更多
A new algorithm based on the projection method with the implicit finite difference technique was established to calculate the velocity fields and pressure.The calculation region can be divided into different regions a...A new algorithm based on the projection method with the implicit finite difference technique was established to calculate the velocity fields and pressure.The calculation region can be divided into different regions according to Reynolds number.In the far-wall region,the thermal melt flow was calculated as Newtonian flow.In the near-wall region,the thermal melt flow was calculated as non-Newtonian flow.It was proved that the new algorithm based on the projection method with the implicit technique was correct through nonparametric statistics method and experiment.The simulation results show that the new algorithm based on the projection method with the implicit technique calculates more quickly than the solution algorithm-volume of fluid method using the explicit difference method.展开更多
The objective of this article is to demonstrate with examples that the two-sided tie correction does not work well. This correction was developed by Cureton so that Kendall’s tau-type and Spearman’s rho-type formula...The objective of this article is to demonstrate with examples that the two-sided tie correction does not work well. This correction was developed by Cureton so that Kendall’s tau-type and Spearman’s rho-type formulas for rank-biserial correlation yield the same result when ties are present. However, a correction based on the bracket ties achieves the desired goal, which is demonstrated algebraically and checked with three examples. On the one hand, the 10-element random sample given by Cureton, in which the two-sided tie correction performs well, is taken up. On the other hand, two other examples are given, one with a 7-element random sample and the other with a clinical random sample of 31 participants, in which the two-sided tie correction does not work, but the new correction does. It is concluded that the new corrected formulas coincide with Goodman-Kruskal’s gamma as compared to Glass’ formula that matches Somers’ d<sub>Y</sub><sub>|X</sub> or asymmetric measure of association of Y ranking with respect to X dichotomy. The use of this underreported coefficient is suggested, which is very easy to calculate from its equivalence with Kruskal-Wallis’ gamma and Somers’ d<sub>Y</sub><sub>|X</sub>.展开更多
Light Detection and Ranging(LiDAR)technology generates dense and precise threedimensional datasets in the form of point clouds.Conventional methods of mapping with airborne LiDAR datasets deal with the process of clas...Light Detection and Ranging(LiDAR)technology generates dense and precise threedimensional datasets in the form of point clouds.Conventional methods of mapping with airborne LiDAR datasets deal with the process of classification or feature specific segmentation.These processes have been observed to be time-consuming and unfit to handle in scenarios where topographic information is required in a small amount of time.Thus there is a requirement of developing methods which process the data and reconstruct the scene in a small amount of time.This paper presents several pipelines for visualizing LiDAR datasets without going through classification and compares them using statistical methods to rank these processes in the order of depth and feature perception.To make the comparison more meaningful,a manually classified and computer-aided design(CAD)reconstructed dataset is also included in the list of compared methods.Results show that a heuristic-based method,previously developed by the authors perform almost equivalent to the manually classified and reconstructed dataset,for the purposes of visualization.This paper makes some distinct contributions as:(1)gives a heuristics-based visualization pipeline for LiDAR datasets,and(2)presents an experimental design supported by statistical analysis to compare different pipelines.展开更多
The method of processing of the non-stationary casual processes with the use of nonparametric methods of the theory of decisions is considered. The use of such methods is admissible in telemetry systems in need of pro...The method of processing of the non-stationary casual processes with the use of nonparametric methods of the theory of decisions is considered. The use of such methods is admissible in telemetry systems in need of processing at real rate of time of fast-changing casual processes in the conditions of aprioristic uncertainty about probabilistic properties of measured process.展开更多
This paper analyzes Bernoulli’s binary sequences in the representation of empirical nonlinear events,analyzing the distribution of natural resources,population sizes and other variables that influence the possible ou...This paper analyzes Bernoulli’s binary sequences in the representation of empirical nonlinear events,analyzing the distribution of natural resources,population sizes and other variables that influence the possible outcomes of resource’s usage.Consider the event as a nonlinear system and the metrics of analysis consisting of two dependent random variables 0 and 1,with memory and probabilities in maximum finite or infinite lengths,constant and equal to 1/2 for both variables(stationary process).The expressions of the possible trajectories of metric space represented by each binary parameter remain constant in sequences that are repeated alternating the presence or absence of one of the binary variables at each iteration(symmetric or asymmetric).It was observed that the binary variables X_(1)and X_(2)assume on time T_(k)→∞specific behaviors(geometric variable)that can be used as management tools in discrete and continuous nonlinear systems aiming at the optimization of resource’s usage,nonlinearity analysis and probabilistic distribution of trajectories occurring about random events.In this way,the paper presents a model of detecting fixed-point attractions and its probabilistic distributions at a given population-resource dynamic.This means that coupling oscillations in the event occur when the binary variables X_(1)and X_(2)are limited as a function of time Y.展开更多
This paper deals with the statistical modeling of latent topic hierarchies in text corpora. The height of the topic tree is assumed as fixed, while the number of topics on each level as unknown a priori and to be infe...This paper deals with the statistical modeling of latent topic hierarchies in text corpora. The height of the topic tree is assumed as fixed, while the number of topics on each level as unknown a priori and to be inferred from data. Taking a nonpara-metric Bayesian approach to this problem, we propose a new probabilistic generative model based on the nested hierarchical Dirichlet process (nHDP) and present a Markov chain Monte Carlo sampling algorithm for the inference of the topic tree structure as well as the word distribution of each topic and topic distribution of each document. Our theoretical analysis and experiment results show that this model can produce a more compact hierarchical topic structure and captures more fine-grained topic rela-tionships compared to the hierarchical latent Dirichlet allocation model.展开更多
文摘The development of defect prediction plays a significant role in improving software quality. Such predictions are used to identify defective modules before the testing and to minimize the time and cost. The software with defects negatively impacts operational costs and finally affects customer satisfaction. Numerous approaches exist to predict software defects. However, the timely and accurate software bugs are the major challenging issues. To improve the timely and accurate software defect prediction, a novel technique called Nonparametric Statistical feature scaled QuAdratic regressive convolution Deep nEural Network (SQADEN) is introduced. The proposed SQADEN technique mainly includes two major processes namely metric or feature selection and classification. First, the SQADEN uses the nonparametric statistical Torgerson–Gower scaling technique for identifying the relevant software metrics by measuring the similarity using the dice coefficient. The feature selection process is used to minimize the time complexity of software fault prediction. With the selected metrics, software fault perdition with the help of the Quadratic Censored regressive convolution deep neural network-based classification. The deep learning classifier analyzes the training and testing samples using the contingency correlation coefficient. The softstep activation function is used to provide the final fault prediction results. To minimize the error, the Nelder–Mead method is applied to solve non-linear least-squares problems. Finally, accurate classification results with a minimum error are obtained at the output layer. Experimental evaluation is carried out with different quantitative metrics such as accuracy, precision, recall, F-measure, and time complexity. The analyzed results demonstrate the superior performance of our proposed SQADEN technique with maximum accuracy, sensitivity and specificity by 3%, 3%, 2% and 3% and minimum time and space by 13% and 15% when compared with the two state-of-the-art methods.
基金National Natural Science Foundation of China(No.12271261)Postgraduate Research and Practice Innovation Program of Jiangsu Province,China(Grant No.SJCX230368)。
文摘Normality testing is a fundamental hypothesis test in the statistical analysis of key biological indicators of diabetes.If this assumption is violated,it may cause the test results to deviate from the true value,leading to incorrect inferences and conclusions,and ultimately affecting the validity and accuracy of statistical inferences.Considering this,the study designs a unified analysis scheme for different data types based on parametric statistical test methods and non-parametric test methods.The data were grouped according to sample type and divided into discrete data and continuous data.To account for differences among subgroups,the conventional chi-squared test was used for discrete data.The normal distribution is the basis of many statistical methods;if the data does not follow a normal distribution,many statistical methods will fail or produce incorrect results.Therefore,before data analysis and modeling,the data were divided into normal and non-normal groups through normality testing.For normally distributed data,parametric statistical methods were used to judge the differences between groups.For non-normal data,non-parametric tests were employed to improve the accuracy of the analysis.Statistically significant indicators were retained according to the significance index P-value of the statistical test or corresponding statistics.These indicators were then combined with relevant medical background to further explore the etiology leading to the occurrence or transformation of diabetes status.
文摘It is well known that the nonparametric estimation of the regression function is highly sensitive to the presence of even a small proportion of outliers in the data.To solve the problem of typical observations when the covariates of the nonparametric component are functional,the robust estimates for the regression parameter and regression operator are introduced.The main propose of the paper is to consider data-driven methods of selecting the number of neighbors in order to make the proposed processes fully automatic.We use thek Nearest Neighbors procedure(kNN)to construct the kernel estimator of the proposed robust model.Under some regularity conditions,we state consistency results for kNN functional estimators,which are uniform in the number of neighbors(UINN).Furthermore,a simulation study and an empirical application to a real data analysis of octane gasoline predictions are carried out to illustrate the higher predictive performances and the usefulness of the kNN approach.
文摘The Cochran-Mantel-Haenszel (CMH) test, developed in the 1950s, is a classic in health research, especially in epidemiology and other fields in which dichotomous and polytomous variables are frequent. This nonparametric test makes it possible to measure and check the effect of an antecedent variable X on a health outcome Y, statistically controlling the effect of a third variable Z that acts as a confounding variable in the relationship between X and Y. Both X and Y are measured on a dichotomous qualitative scale and Z on a polytomous-qualitative or ordinal scale. It is assumed that the effect of X on Y is homogeneous between the k strata of Z, which is usually tested by the Breslow-Day test with the Tarone’s correction or the Woolf’s test. The main statistical programs have the CMH test together with a test to verify the assumption of a homogeneous effect across the strata, so that it is easy to apply. However, its fundamentals and details of calculations are a mystery to most researchers, and even difficult to find or understand. The aim of this article is to present these details in a clear and concise way, including the assumptions and alternatives to non-compliance. This technical knowledge is applied to a simulated realistic example of the area of epidemiology in health and, finally, an interpretive synthesis of the analyses is given. In addition, some suggestions for the test report are made.
基金Chongqing Technology Innovation and Application Development Project,Grant/Award Number:CSTB2022TIAD-KPX0067Sichuan Science and Technology Program,Grant/Award Number:2022YFS0048+1 种基金National Natural Science Foundation of China,Grant/Award Number:62372316National Science and Technology Major Project,Grant/Award Numbers:2018ZX10201002,2021YFF1201200。
文摘Cardiovascular disease(CVD)is the major cause of death in many regions around the world,and several of its risk factors might be linked to diets.To improve public health and the understanding of this topic,we look at the recent Minnesota Coronary Experiment(MCE)analysis that used t-test and Cox model to evaluate CVD risks.However,these parametric methods might suffer from three problems:small sample size,right-censored bias,and lack of long-term evidence.To overcome the first of these challenges,we utilize a nonparametric permutation test to examine the relationship between dietary fats and serum total cholesterol.To address the second problem,we use a resampling-based rank test to examine whether the serum total cholesterol level affects CVD deaths.For the third issue,we use some extra-Framingham Heart Study(FHS)data with an A/B test to look for meta-relationship between diets,risk factors,and CVD risks.We show that,firstly,the link between low saturated fat diets and reduction in serum total cholesterol is strong.Secondly,reducing serum total cholesterol does not robustly have an impact on CVD hazards in the diet group.Lastly,the A/B test result suggests a more complicated relationship regarding abnormal diastolic blood pressure ranges caused by diets and how these might affect the associative link between the cholesterol level and heart disease risks.This study not only helps us to deeply analyze the MCE data but also,in combination with the long-term FHS data,reveals possible complex relationships behind diets,risk factors,and heart disease.
基金Project (50975263) supported by the National Natural Science Foundation of ChinaProject (2010081015) supported by International Cooperation Project of Shanxi Province, China+1 种基金 Project (2010-78) supported by the Scholarship Council in Shanxi province, ChinaProject (2010420120005) supported by Doctoral Fund of Ministry of Education of China
文摘A new algorithm based on the projection method with the implicit finite difference technique was established to calculate the velocity fields and pressure.The calculation region can be divided into different regions according to Reynolds number.In the far-wall region,the thermal melt flow was calculated as Newtonian flow.In the near-wall region,the thermal melt flow was calculated as non-Newtonian flow.It was proved that the new algorithm based on the projection method with the implicit technique was correct through nonparametric statistics method and experiment.The simulation results show that the new algorithm based on the projection method with the implicit technique calculates more quickly than the solution algorithm-volume of fluid method using the explicit difference method.
文摘The objective of this article is to demonstrate with examples that the two-sided tie correction does not work well. This correction was developed by Cureton so that Kendall’s tau-type and Spearman’s rho-type formulas for rank-biserial correlation yield the same result when ties are present. However, a correction based on the bracket ties achieves the desired goal, which is demonstrated algebraically and checked with three examples. On the one hand, the 10-element random sample given by Cureton, in which the two-sided tie correction performs well, is taken up. On the other hand, two other examples are given, one with a 7-element random sample and the other with a clinical random sample of 31 participants, in which the two-sided tie correction does not work, but the new correction does. It is concluded that the new corrected formulas coincide with Goodman-Kruskal’s gamma as compared to Glass’ formula that matches Somers’ d<sub>Y</sub><sub>|X</sub> or asymmetric measure of association of Y ranking with respect to X dichotomy. The use of this underreported coefficient is suggested, which is very easy to calculate from its equivalence with Kruskal-Wallis’ gamma and Somers’ d<sub>Y</sub><sub>|X</sub>.
文摘Light Detection and Ranging(LiDAR)technology generates dense and precise threedimensional datasets in the form of point clouds.Conventional methods of mapping with airborne LiDAR datasets deal with the process of classification or feature specific segmentation.These processes have been observed to be time-consuming and unfit to handle in scenarios where topographic information is required in a small amount of time.Thus there is a requirement of developing methods which process the data and reconstruct the scene in a small amount of time.This paper presents several pipelines for visualizing LiDAR datasets without going through classification and compares them using statistical methods to rank these processes in the order of depth and feature perception.To make the comparison more meaningful,a manually classified and computer-aided design(CAD)reconstructed dataset is also included in the list of compared methods.Results show that a heuristic-based method,previously developed by the authors perform almost equivalent to the manually classified and reconstructed dataset,for the purposes of visualization.This paper makes some distinct contributions as:(1)gives a heuristics-based visualization pipeline for LiDAR datasets,and(2)presents an experimental design supported by statistical analysis to compare different pipelines.
文摘The method of processing of the non-stationary casual processes with the use of nonparametric methods of the theory of decisions is considered. The use of such methods is admissible in telemetry systems in need of processing at real rate of time of fast-changing casual processes in the conditions of aprioristic uncertainty about probabilistic properties of measured process.
文摘This paper analyzes Bernoulli’s binary sequences in the representation of empirical nonlinear events,analyzing the distribution of natural resources,population sizes and other variables that influence the possible outcomes of resource’s usage.Consider the event as a nonlinear system and the metrics of analysis consisting of two dependent random variables 0 and 1,with memory and probabilities in maximum finite or infinite lengths,constant and equal to 1/2 for both variables(stationary process).The expressions of the possible trajectories of metric space represented by each binary parameter remain constant in sequences that are repeated alternating the presence or absence of one of the binary variables at each iteration(symmetric or asymmetric).It was observed that the binary variables X_(1)and X_(2)assume on time T_(k)→∞specific behaviors(geometric variable)that can be used as management tools in discrete and continuous nonlinear systems aiming at the optimization of resource’s usage,nonlinearity analysis and probabilistic distribution of trajectories occurring about random events.In this way,the paper presents a model of detecting fixed-point attractions and its probabilistic distributions at a given population-resource dynamic.This means that coupling oscillations in the event occur when the binary variables X_(1)and X_(2)are limited as a function of time Y.
基金Project (No. 60773180) supported by the National Natural Science Foundation of China
文摘This paper deals with the statistical modeling of latent topic hierarchies in text corpora. The height of the topic tree is assumed as fixed, while the number of topics on each level as unknown a priori and to be inferred from data. Taking a nonpara-metric Bayesian approach to this problem, we propose a new probabilistic generative model based on the nested hierarchical Dirichlet process (nHDP) and present a Markov chain Monte Carlo sampling algorithm for the inference of the topic tree structure as well as the word distribution of each topic and topic distribution of each document. Our theoretical analysis and experiment results show that this model can produce a more compact hierarchical topic structure and captures more fine-grained topic rela-tionships compared to the hierarchical latent Dirichlet allocation model.