A research study collected intensive longitudinal data from cancer patients on a daily basis as well as non-intensive longitudinal survey data on a monthly basis. Although the daily data need separate analysis, those ...A research study collected intensive longitudinal data from cancer patients on a daily basis as well as non-intensive longitudinal survey data on a monthly basis. Although the daily data need separate analysis, those data can also be utilized to generate predictors of monthly outcomes. Alternatives for generating daily data predictors of monthly outcomes are addressed in this work. Analyses are reported of depression measured by the Patient Health Questionnaire 8 as the monthly survey outcome. Daily measures include numbers of opioid medications taken, numbers of pain flares, least pain levels, and worst pain levels. Predictors are averages of recent non-missing values for each daily measure recorded on or prior to survey dates for depression values. Weights for recent non-missing values are based on days between measurement of a recent value and a survey date. Five alternative averages are considered: averages with unit weights, averages with reciprocal weights, weighted averages with reciprocal weights, averages with exponential weights, and weighted averages with exponential weights. Adaptive regression methods based on likelihood cross-validation (LCV) scores are used to generate fractional polynomial models for possible nonlinear dependence of depression on each average. For all four daily measures, the best LCV score over averages of all types is generated using the average of recent non-missing values with reciprocal weights. Generated models are nonlinear and monotonic. Results indicate that an appropriate choice would be to assume three recent non-missing values and use the average with reciprocal weights of the first three recent non-missing values.展开更多
Background: Bivariate count data are commonly encountered in medicine, biology, engineering, epidemiology and many other applications. The Poisson distribution has been the model of choice to analyze such data. In mos...Background: Bivariate count data are commonly encountered in medicine, biology, engineering, epidemiology and many other applications. The Poisson distribution has been the model of choice to analyze such data. In most cases mutual independence among the variables is assumed, however this fails to take into accounts the correlation between the outcomes of interests. A special bivariate form of the multivariate Lagrange family of distribution, names Generalized Bivariate Poisson Distribution, is considered in this paper. Objectives: We estimate the model parameters using the method of maximum likelihood and show that the model fits the count variables representing components of metabolic syndrome in spousal pairs. We use the likelihood local score to test the significance of the correlation between the counts. We also construct confidence interval on the ratio of the two correlated Poisson means. Methods: Based on a random sample of pairs of count data, we show that the score test of independence is locally most powerful. We also provide a formula for sample size estimation for given level of significance and given power. The confidence intervals on the ratio of correlated Poisson means are constructed using the delta method, the Fieller’s theorem, and the nonparametric bootstrap. We illustrate the methodologies on metabolic syndrome data collected from 4000 spousal pairs. Results: The bivariate Poisson model fitted the metabolic syndrome data quite satisfactorily. Moreover, the three methods of confidence interval estimation were almost identical, meaning that they have the same interval width.展开更多
Adaptive fractional polynomial modeling of general correlated outcomes is formulated to address nonlinearity in means, variances/dispersions, and correlations. Means and variances/dispersions are modeled using general...Adaptive fractional polynomial modeling of general correlated outcomes is formulated to address nonlinearity in means, variances/dispersions, and correlations. Means and variances/dispersions are modeled using generalized linear models in fixed effects/coefficients. Correlations are modeled using random effects/coefficients. Nonlinearity is addressed using power transforms of primary (untransformed) predictors. Parameter estimation is based on extended linear mixed modeling generalizing both generalized estimating equations and linear mixed modeling. Models are evaluated using likelihood cross-validation (LCV) scores and are generated adaptively using a heuristic search controlled by LCV scores. Cases covered include linear, Poisson, logistic, exponential, and discrete regression of correlated continuous, count/rate, dichotomous, positive continuous, and discrete numeric outcomes treated as normally, Poisson, Bernoulli, exponentially, and discrete numerically distributed, respectively. Example analyses are also generated for these five cases to compare adaptive random effects/coefficients modeling of correlated outcomes to previously developed adaptive modeling based on directly specified covariance structures. Adaptive random effects/coefficients modeling substantially outperforms direct covariance modeling in the linear, exponential, and discrete regression example analyses. It generates equivalent results in the logistic regression example analyses and it is substantially outperformed in the Poisson regression case. Random effects/coefficients modeling of correlated outcomes can provide substantial improvements in model selection compared to directly specified covariance modeling. However, directly specified covariance modeling can generate competitive or substantially better results in some cases while usually requiring less computation time.展开更多
Chemical oxygen demand (COD) is an important index to measure the degree of water pollution. In this paper, near-infrared technology is used to obtain 148 wastewater spectra to predict the COD value in wastewater. Fir...Chemical oxygen demand (COD) is an important index to measure the degree of water pollution. In this paper, near-infrared technology is used to obtain 148 wastewater spectra to predict the COD value in wastewater. First, the partial least squares regression (PLS) model was used as the basic model. Monte Carlo cross-validation (MCCV) was used to select 25 samples out of 148 samples that did not conform to conventional statistics. Then, the interval partial least squares (iPLS) regression modeling was carried out on 123 samples, and the spectral bands were divided into 40 subintervals. The optimal subintervals are 20 and 26, and the optimal correlation coefficient of the test set (RT) is 0.58. Further, the waveband is divided into five intervals: 17, 19, 20, 22 and 26. When the number of joint intervals under each interval is three, the optimal RT is 0.71. When the number of joint subintervals is four, the optimal RT is 0.79. Finally, convolutional neural network (CNN) was used for quantitative prediction, and RT was 0.9. The results show that CNN can automatically screen the features inside the data, and the quantitative prediction effect is better than that of iPLS and synergy interval partial least squares model (SiPLS) with joint subinterval three and four, indicating that CNN can be used for quantitative analysis of water pollution degree.展开更多
This study delves into the multifaceted impact of price hikes on the standard of living in Bangladesh, with a specific focus on distinct socioeconomic segments. Amidst Bangladesh’s economic growth, the challenges of ...This study delves into the multifaceted impact of price hikes on the standard of living in Bangladesh, with a specific focus on distinct socioeconomic segments. Amidst Bangladesh’s economic growth, the challenges of rising inflation and increased living costs have become pressing concerns. Employing a mixed-methods approach combines quantitative data from a structured survey with qualitative insights from in-depth interviews and focused group discussions to analyze the repercussions of price hikes. Stratified random sampling ensures representation across affluent, middle-class, and economically disadvantaged groups. Utilizing data [1] from 2020 to November 2023 on the yearly change in retail prices of essential commodities, analysis reveals significant demographic shifts, occupational changes, and altered asset ownership patterns among households. The vulnerable population, including daily wage laborers and low-income individuals, is disproportionately affected by adjustments in consumption, income generation, and living arrangements. Statistical analyses, including One-Way ANOVA and Paired Sample t-tests, illuminate significant mean differences in strategies employed during price hikes. Despite challenges, the prioritization of education remains evident, emphasizing its resilience in the face of economic hardships. The result shows that price hikes, especially in essential items, lead to substantial adjustments in living costs, with items like onions, garlic, and ginger experiencing significant increases of 275%, 108%, and 483%, respectively.展开更多
Spatial variation is often encountered when large scale field trials are conducted which can result in biased estimation or prediction of treatment (i.e. genotype) values. An effective removal of spatial variation is ...Spatial variation is often encountered when large scale field trials are conducted which can result in biased estimation or prediction of treatment (i.e. genotype) values. An effective removal of spatial variation is needed to ensure unbiased estimation or prediction and thus increase the accuracy of field data evaluation. A moving grid adjustment (MGA) method, which was proposed by Technow, was evaluated through Monte Carlo simulation for its statistical properties regarding field spatial variation control. Our simulation results showed that the MGA method can effectively account for field spatial variation if it does exist;however, this method will not change phenotype results if field spatial variation does not exist. The MGA method was applied to a large-scale cotton field trial data set with two representative agronomic traits: lint yield (strong field spatial pattern) and lint percentage (no field spatial pattern). The results suggested that the MGA method was able to effectively separate the spatial variation including blocking effects from random error variation for lint yield while the adjusted data remained almost identical to the original phenotypic data. With application of the MGA method, the estimated variance for residuals was significantly reduced (62.2% decrease) for lint yield while more genetic variation (29.7% increase) was detected compared to the original data analysis subject to the conventional randomized complete block design analysis. On the other hand, the results were almost identical for lint percentage with and without the application of the MGA method. Therefore, the MGA method can be a useful addition to enhance data analysis when field spatial pattern exists.展开更多
Developing a predictive model for detecting cardiovascular diseases (CVDs) is crucial due to its high global fatality rate. With the advancements in artificial intelligence, the availability of large-scale data, and i...Developing a predictive model for detecting cardiovascular diseases (CVDs) is crucial due to its high global fatality rate. With the advancements in artificial intelligence, the availability of large-scale data, and increased access to computational capability, it is feasible to create robust models that can detect CVDs with high precision. This study aims to provide a promising method for early diagnosis by employing various machine learning and deep learning techniques, including logistic regression, decision trees, random forest classifier, extreme gradient boosting (XGBoost), and a sequential model from Keras. Our evaluation identifies the random forest classifier as the most effective model, achieving an accuracy of 0.91, surpassing other machine learning and deep learning approaches. Close behind are XGBoost (accuracy: 0.90), decision tree (accuracy: 0.86), and logistic regression (accuracy: 0.70). Additionally, our deep learning sequential model demonstrates promising classification performance, with an accuracy of 0.80 and a loss of 0.425 on the validation set. These findings underscore the potential of machine learning and deep learning methodologies in advancing cardiovascular disease prediction and management strategies.展开更多
Effects of performing an R-factor analysis of observed variables based on population models comprising R- and Q-factors were investigated. Although R-factor analysis of data based on a population model comprising R- a...Effects of performing an R-factor analysis of observed variables based on population models comprising R- and Q-factors were investigated. Although R-factor analysis of data based on a population model comprising R- and Q-factors is possible, this may lead to model error. Accordingly, loading estimates resulting from R-factor analysis of sample data drawn from a population based on a combination of R- and Q-factors will be biased. It was shown in a simulation study that a large amount of Q-factor variance induces an increase in the variation of R-factor loading estimates beyond the chance level. Tests of the multivariate kurtosis of observed variables are proposed as an indicator of possible Q-factor variance in observed variables as a prerequisite for R-factor analysis.展开更多
Although there are many measures of variability for qualitative variables, they are little used in social research, nor are they included in statistical software. The aim of this article is to present six measures of ...Although there are many measures of variability for qualitative variables, they are little used in social research, nor are they included in statistical software. The aim of this article is to present six measures of variation for qualitative variables of simple calculation, as well as to facilitate their use by means of the R software. The measures considered are, on the one hand, Freemans variation ratio, Morals universal variation ratio, Kvalseths standard deviation from the mode, and Wilcoxs variation ratio which are most affected by proximity to a constant random variable, where the measures of variability for qualitative variables reach their minimum value of 0. On the other hand, the Gibbs-Poston index of qualitative variation and Shannons relative entropy are included, which are more affected by the proximity to a uniform distribution, where the measures of variability for qualitative variables reach their maximum value of 1. Point and interval estimation are addressed. Bootstrap by the percentile and bias-corrected and accelerated percentile methods are used to obtain confidence intervals. Two calculation situations are presented: with a sample mode and with two or more modes. The standard deviation from the mode among the six considered measures, and the universal variation ratio among the three variation ratios, are particularly recommended for use.展开更多
The paper presents an innovative approach towards agricultural insurance underwriting and risk pricing through the development of an Extreme Machine Learning (ELM) Actuarial Intelligent Model. This model integrates di...The paper presents an innovative approach towards agricultural insurance underwriting and risk pricing through the development of an Extreme Machine Learning (ELM) Actuarial Intelligent Model. This model integrates diverse datasets, including climate change scenarios, crop types, farm sizes, and various risk factors, to automate underwriting decisions and estimate loss reserves in agricultural insurance. The study conducts extensive exploratory data analysis, model building, feature engineering, and validation to demonstrate the effectiveness of the proposed approach. Additionally, the paper discusses the application of robust tests, stress tests, and scenario tests to assess the model’s resilience and adaptability to changing market conditions. Overall, the research contributes to advancing actuarial science in agricultural insurance by leveraging advanced machine learning techniques for enhanced risk management and decision-making.展开更多
This study aims to establish a rationale for the Rice University rule in determining the number of bins in a histogram. It is grounded in the Scott and Freedman-Diaconis rules. Additionally, the accuracy of the empiri...This study aims to establish a rationale for the Rice University rule in determining the number of bins in a histogram. It is grounded in the Scott and Freedman-Diaconis rules. Additionally, the accuracy of the empirical histogram in reproducing the shape of the distribution is assessed with respect to three factors: the rule for determining the number of bins (square root, Sturges, Doane, Scott, Freedman-Diaconis, and Rice University), sample size, and distribution type. Three measures are utilized: the average distance between empirical and theoretical histograms, the level of recognition by an expert judge, and the accuracy index, which is composed of the two aforementioned measures. Mean comparisons are conducted with aligned rank transformation analysis of variance for three fixed-effects factors: sample size (20, 35, 50, 100, 200, 500, and 1000), distribution type (10 types), and empirical rule to determine the number of bins (6 rules). From the accuracy index, Rice’s rule improves with increasing sample size and is independent of distribution type. It outperforms the Friedman-Diaconis rule but falls short of Scott’s rule, except with the arcsine distribution. Its profile of means resembles the square root rule concerning distributions and Doane’s rule concerning sample sizes. These profiles differ from those of the Scott and Friedman-Diaconis rules, which resemble each other. Among the seven rules, Scott’s rule stands out in terms of accuracy, except for the arcsine distribution, and the square root rule is the least accurate.展开更多
Background: The signal-to-noise ratio (SNR) is recognized as an index of measurements reproducibility. We derive the maximum likelihood estimators of SNR and discuss confidence interval construction on the difference ...Background: The signal-to-noise ratio (SNR) is recognized as an index of measurements reproducibility. We derive the maximum likelihood estimators of SNR and discuss confidence interval construction on the difference between two correlated SNRs when the readings are from bivariate normal and bivariate lognormal distribution. We use the Pearsons system of curves to approximate the difference between the two estimates and use the bootstrap methods to validate the approximate distributions of the statistic of interest. Methods: The paper uses the delta method to find the first four central moments, and hence the skewness and kurtosis which are important in the determination of the parameters of the Pearsons distribution. Results: The approach is illustrated in two examples;one from veterinary microbiology and food safety data and the other on data from clinical medicine. We derived the four central moments of the target statistics, together with the bootstrap method to evaluate the parameters of Pearsons distribution. The fitted Pearsons curves of Types I and II were recommended based on the available data. The R-codes are also provided to be readily used by the readers.展开更多
In this paper, the Automated Actuarial Loss Reserving Model is developed and extended using machine learning. The traditional actuarial reserving techniques are no longer compatible with the increase in technological ...In this paper, the Automated Actuarial Loss Reserving Model is developed and extended using machine learning. The traditional actuarial reserving techniques are no longer compatible with the increase in technological advancement currently at hand. As a result, the development of the alternative Artificial Intelligence Based Automated Actuarial Loss Reserving Methodology which captures diverse risk profiles for various policyholders through augmenting the Micro Finance services, Auto Insurance Services and Both Services lines of business on the same platform through the computation of the Comprehensive Automated Actuarial Loss Reserves (CAALR) has been implemented in this paper. The introduction of the four further types of actuarial loss reserves to those existing in the actuarial literature seems to significantly reduce lapse rates, reduce the reinsurance costs as well as expenses and outgo. As a matter of consequence, this helps to bring together a combination of new and existing policyholders in the insurance company. The frequency severity models have been extended in this paper using ten machine learning algorithms which ultimately leads to the derivation of the proposed machine learning-based actuarial loss reserving model which remarkably performed well when compared to the traditional chain ladder actuarial reserving method using simulated data.展开更多
Sample size determination typically relies on a power analysis based on a frequentist conditional approach. This latter can be seen as a particular case of the two-priors approach, which allows to build four distinct ...Sample size determination typically relies on a power analysis based on a frequentist conditional approach. This latter can be seen as a particular case of the two-priors approach, which allows to build four distinct power functions to select the optimal sample size. We revise this approach when the focus is on testing a single binomial proportion. We consider exact methods and introduce a conservative criterion to account for the typical non-monotonic behavior of the power functions, when dealing with discrete data. The main purpose of this paper is to present a Shiny App providing a user-friendly, interactive tool to apply these criteria. The app also provides specific tools to elicit the analysis and the design prior distributions, which are the core of the two-priors approach.展开更多
For more than a century, forecasting models have been crucial in a variety of fields. Models can offer the most accurate forecasting outcomes if error terms are normally distributed. Finding a good statistical model f...For more than a century, forecasting models have been crucial in a variety of fields. Models can offer the most accurate forecasting outcomes if error terms are normally distributed. Finding a good statistical model for time series predicting imports in Malaysia is the main target of this study. The decision made during this study mostly addresses the unrestricted error correction model (UECM), and composite model (Combined regression—ARIMA). The imports of Malaysia from the first quarter of 1991 to the third quarter of 2022 are employed in this study’s quarterly time series data. The forecasting outcomes of the current study demonstrated that the composite model offered more probabilistic data, which improved forecasting the volume of Malaysia’s imports. The composite model, and the UECM model in this study are linear models based on responses to Malaysia’s imports. Future studies might compare the performance of linear and nonlinear models in forecasting.展开更多
It is acknowledged today within the scientific community that two types of actions must be considered to limit global warming: mitigation actions by reducing GHG emissions, to contain the rate of global warming, and a...It is acknowledged today within the scientific community that two types of actions must be considered to limit global warming: mitigation actions by reducing GHG emissions, to contain the rate of global warming, and adaptation actions to adapt societies to Climate Change, to limit losses and damages [1] [2]. As far as adaptation actions are concerned, numerical simulation, due to its results, its costs which require less investment than tests carried out on complex mechanical structures, and its implementation facilities, appears to be a major step in the design and prediction of complex mechanical systems. However, despite the quality of the results obtained, biases and inaccuracies related to the structure of the models do exist. Therefore, there is a need to validate the results of this SARIMA-LSTM-digital learning model adjusted by a matching approach, “calculating-test”, in order to assess the quality of the results and the performance of the model. The methodology consists of exploiting two climatic databases (temperature and precipitation), one of which is in-situ and the other spatial, all derived from grid points. Data from the dot grids are processed and stored in specific formats and, through machine learning approaches, complex mathematical equations are worked out and interconnections within the climate system established. Through this mathematical approach, it is possible to predict the future climate of the Sudano-Sahelian zone of Cameroon and to propose adaptation strategies.展开更多
This paper considers the compound Poisson risk model perturbed by Brownian motion with variable premium and dependence between claims amounts and inter-claim times via Spearman copula. It is assumed that the insurance...This paper considers the compound Poisson risk model perturbed by Brownian motion with variable premium and dependence between claims amounts and inter-claim times via Spearman copula. It is assumed that the insurance company’s portfolio is governed by two classes of policyholders. On the one hand, the first class where the amount of claims is high, and on the other hand, the second class where the amount of claims is low, this difference in claim amounts has significant implications for the insurance company’s pricing and risk management strategies. When policyholders are in the first class, they pay an insurance premium of a constant amount c<sub>1</sub> and when they are in the second class, the premium paid is a constant amount c<sub>2</sub> such that c<sub>1 </sub>> c<sub>2</sub>. The nature of claims (low or high) is measured via random thresholds . The study in this work will focus on the determination of the integro-differential equations satisfied by Gerber-Shiu functions and their Laplace transforms in the risk model perturbed by Brownian motion with variable premium and dependence between claims amounts and inter-claim times via Spearman copula. .展开更多
The Automated Actuarial Pricing and Underwriting Model has been enhanced and expanded through the implementation of Artificial Intelligence to automate three distinct actuarial functions: loss reserving, pricing, and ...The Automated Actuarial Pricing and Underwriting Model has been enhanced and expanded through the implementation of Artificial Intelligence to automate three distinct actuarial functions: loss reserving, pricing, and underwriting. This model utilizes data analytics based on Artificial Intelligence to merge microfinance and car insurance services. Introducing and applying a no-claims bonus rate system, comprising base rates, variable rates, and final rates, to three key policyholder categories significantly reduces the occurrence and impact of claims while encouraging increased premium payments. We have enhanced frequency-severity models with eight machine learning algorithms and adjusted the Automated Actuarial Pricing and Underwriting Model for inflation, resulting in outstanding performance. Among the machine learning models utilized, the Random Forest (RANGER) achieved the highest Total Aggregate Comprehensive Automated Actuarial Loss Reserve Risk Pricing Balance (ACAALRRPB), establishing itself as the preferred model for developing Automated Actuarial Underwriting models tailored to specific policyholder categories.展开更多
Interrater reliability (IRR) statistics, like Cohen’s kappa, measure agreement between raters beyond what is expected by chance when classifying items into categories. While Cohen’s kappa has been widely used, it ha...Interrater reliability (IRR) statistics, like Cohen’s kappa, measure agreement between raters beyond what is expected by chance when classifying items into categories. While Cohen’s kappa has been widely used, it has several limitations, prompting development of Gwet’s agreement statistic, an alternative “kappa”statistic which models chance agreement via an “occasional guessing” model. However, we show that Gwet’s formula for estimating the proportion of agreement due to chance is itself biased for intermediate levels of agreement, despite overcoming limitations of Cohen’s kappa at high and low agreement levels. We derive a maximum likelihood estimator for the occasional guessing model that yields an unbiased estimator of the IRR, which we call the maximum likelihood kappa (κML). The key result is that the chance agreement probability under the occasional guessing model is simply equal to the observed rate of disagreement between raters. The κMLstatistic provides a theoretically principled approach to quantifying IRR that addresses limitations of previous κcoefficients. Given the widespread use of IRR measures, having an unbiased estimator is important for reliable inference across domains where rater judgments are analyzed.展开更多
This study proposes a novel approach for estimating automobile insurance loss reserves utilizing Artificial Neural Network (ANN) techniques integrated with actuarial data intelligence. The model aims to address the ch...This study proposes a novel approach for estimating automobile insurance loss reserves utilizing Artificial Neural Network (ANN) techniques integrated with actuarial data intelligence. The model aims to address the challenges of accurately predicting insurance claim frequencies, severities, and overall loss reserves while accounting for inflation adjustments. Through comprehensive data analysis and model development, this research explores the effectiveness of ANN methodologies in capturing complex nonlinear relationships within insurance data. The study leverages a data set comprising automobile insurance policyholder information, claim history, and economic indicators to train and validate the ANN-based reserving model. Key aspects of the methodology include data preprocessing techniques such as one-hot encoding and scaling, followed by the construction of frequency, severity, and overall loss reserving models using ANN architectures. Moreover, the model incorporates inflation adjustment factors to ensure the accurate estimation of future loss reserves in real terms. Results from the study demonstrate the superior predictive performance of the ANN-based reserving model compared to traditional actuarial methods, with substantial improvements in accuracy and robustness. Furthermore, the model’s ability to adapt to changing market conditions and regulatory requirements, such as IFRS17, highlights its practical relevance in the insurance industry. The findings of this research contribute to the advancement of actuarial science and provide valuable insights for insurance companies seeking more accurate and efficient loss reserving techniques. The proposed ANN-based approach offers a promising avenue for enhancing risk management practices and optimizing financial decision-making processes in the automobile insurance sector.展开更多
文摘A research study collected intensive longitudinal data from cancer patients on a daily basis as well as non-intensive longitudinal survey data on a monthly basis. Although the daily data need separate analysis, those data can also be utilized to generate predictors of monthly outcomes. Alternatives for generating daily data predictors of monthly outcomes are addressed in this work. Analyses are reported of depression measured by the Patient Health Questionnaire 8 as the monthly survey outcome. Daily measures include numbers of opioid medications taken, numbers of pain flares, least pain levels, and worst pain levels. Predictors are averages of recent non-missing values for each daily measure recorded on or prior to survey dates for depression values. Weights for recent non-missing values are based on days between measurement of a recent value and a survey date. Five alternative averages are considered: averages with unit weights, averages with reciprocal weights, weighted averages with reciprocal weights, averages with exponential weights, and weighted averages with exponential weights. Adaptive regression methods based on likelihood cross-validation (LCV) scores are used to generate fractional polynomial models for possible nonlinear dependence of depression on each average. For all four daily measures, the best LCV score over averages of all types is generated using the average of recent non-missing values with reciprocal weights. Generated models are nonlinear and monotonic. Results indicate that an appropriate choice would be to assume three recent non-missing values and use the average with reciprocal weights of the first three recent non-missing values.
文摘Background: Bivariate count data are commonly encountered in medicine, biology, engineering, epidemiology and many other applications. The Poisson distribution has been the model of choice to analyze such data. In most cases mutual independence among the variables is assumed, however this fails to take into accounts the correlation between the outcomes of interests. A special bivariate form of the multivariate Lagrange family of distribution, names Generalized Bivariate Poisson Distribution, is considered in this paper. Objectives: We estimate the model parameters using the method of maximum likelihood and show that the model fits the count variables representing components of metabolic syndrome in spousal pairs. We use the likelihood local score to test the significance of the correlation between the counts. We also construct confidence interval on the ratio of the two correlated Poisson means. Methods: Based on a random sample of pairs of count data, we show that the score test of independence is locally most powerful. We also provide a formula for sample size estimation for given level of significance and given power. The confidence intervals on the ratio of correlated Poisson means are constructed using the delta method, the Fieller’s theorem, and the nonparametric bootstrap. We illustrate the methodologies on metabolic syndrome data collected from 4000 spousal pairs. Results: The bivariate Poisson model fitted the metabolic syndrome data quite satisfactorily. Moreover, the three methods of confidence interval estimation were almost identical, meaning that they have the same interval width.
文摘Adaptive fractional polynomial modeling of general correlated outcomes is formulated to address nonlinearity in means, variances/dispersions, and correlations. Means and variances/dispersions are modeled using generalized linear models in fixed effects/coefficients. Correlations are modeled using random effects/coefficients. Nonlinearity is addressed using power transforms of primary (untransformed) predictors. Parameter estimation is based on extended linear mixed modeling generalizing both generalized estimating equations and linear mixed modeling. Models are evaluated using likelihood cross-validation (LCV) scores and are generated adaptively using a heuristic search controlled by LCV scores. Cases covered include linear, Poisson, logistic, exponential, and discrete regression of correlated continuous, count/rate, dichotomous, positive continuous, and discrete numeric outcomes treated as normally, Poisson, Bernoulli, exponentially, and discrete numerically distributed, respectively. Example analyses are also generated for these five cases to compare adaptive random effects/coefficients modeling of correlated outcomes to previously developed adaptive modeling based on directly specified covariance structures. Adaptive random effects/coefficients modeling substantially outperforms direct covariance modeling in the linear, exponential, and discrete regression example analyses. It generates equivalent results in the logistic regression example analyses and it is substantially outperformed in the Poisson regression case. Random effects/coefficients modeling of correlated outcomes can provide substantial improvements in model selection compared to directly specified covariance modeling. However, directly specified covariance modeling can generate competitive or substantially better results in some cases while usually requiring less computation time.
文摘Chemical oxygen demand (COD) is an important index to measure the degree of water pollution. In this paper, near-infrared technology is used to obtain 148 wastewater spectra to predict the COD value in wastewater. First, the partial least squares regression (PLS) model was used as the basic model. Monte Carlo cross-validation (MCCV) was used to select 25 samples out of 148 samples that did not conform to conventional statistics. Then, the interval partial least squares (iPLS) regression modeling was carried out on 123 samples, and the spectral bands were divided into 40 subintervals. The optimal subintervals are 20 and 26, and the optimal correlation coefficient of the test set (RT) is 0.58. Further, the waveband is divided into five intervals: 17, 19, 20, 22 and 26. When the number of joint intervals under each interval is three, the optimal RT is 0.71. When the number of joint subintervals is four, the optimal RT is 0.79. Finally, convolutional neural network (CNN) was used for quantitative prediction, and RT was 0.9. The results show that CNN can automatically screen the features inside the data, and the quantitative prediction effect is better than that of iPLS and synergy interval partial least squares model (SiPLS) with joint subinterval three and four, indicating that CNN can be used for quantitative analysis of water pollution degree.
文摘This study delves into the multifaceted impact of price hikes on the standard of living in Bangladesh, with a specific focus on distinct socioeconomic segments. Amidst Bangladesh’s economic growth, the challenges of rising inflation and increased living costs have become pressing concerns. Employing a mixed-methods approach combines quantitative data from a structured survey with qualitative insights from in-depth interviews and focused group discussions to analyze the repercussions of price hikes. Stratified random sampling ensures representation across affluent, middle-class, and economically disadvantaged groups. Utilizing data [1] from 2020 to November 2023 on the yearly change in retail prices of essential commodities, analysis reveals significant demographic shifts, occupational changes, and altered asset ownership patterns among households. The vulnerable population, including daily wage laborers and low-income individuals, is disproportionately affected by adjustments in consumption, income generation, and living arrangements. Statistical analyses, including One-Way ANOVA and Paired Sample t-tests, illuminate significant mean differences in strategies employed during price hikes. Despite challenges, the prioritization of education remains evident, emphasizing its resilience in the face of economic hardships. The result shows that price hikes, especially in essential items, lead to substantial adjustments in living costs, with items like onions, garlic, and ginger experiencing significant increases of 275%, 108%, and 483%, respectively.
文摘Spatial variation is often encountered when large scale field trials are conducted which can result in biased estimation or prediction of treatment (i.e. genotype) values. An effective removal of spatial variation is needed to ensure unbiased estimation or prediction and thus increase the accuracy of field data evaluation. A moving grid adjustment (MGA) method, which was proposed by Technow, was evaluated through Monte Carlo simulation for its statistical properties regarding field spatial variation control. Our simulation results showed that the MGA method can effectively account for field spatial variation if it does exist;however, this method will not change phenotype results if field spatial variation does not exist. The MGA method was applied to a large-scale cotton field trial data set with two representative agronomic traits: lint yield (strong field spatial pattern) and lint percentage (no field spatial pattern). The results suggested that the MGA method was able to effectively separate the spatial variation including blocking effects from random error variation for lint yield while the adjusted data remained almost identical to the original phenotypic data. With application of the MGA method, the estimated variance for residuals was significantly reduced (62.2% decrease) for lint yield while more genetic variation (29.7% increase) was detected compared to the original data analysis subject to the conventional randomized complete block design analysis. On the other hand, the results were almost identical for lint percentage with and without the application of the MGA method. Therefore, the MGA method can be a useful addition to enhance data analysis when field spatial pattern exists.
文摘Developing a predictive model for detecting cardiovascular diseases (CVDs) is crucial due to its high global fatality rate. With the advancements in artificial intelligence, the availability of large-scale data, and increased access to computational capability, it is feasible to create robust models that can detect CVDs with high precision. This study aims to provide a promising method for early diagnosis by employing various machine learning and deep learning techniques, including logistic regression, decision trees, random forest classifier, extreme gradient boosting (XGBoost), and a sequential model from Keras. Our evaluation identifies the random forest classifier as the most effective model, achieving an accuracy of 0.91, surpassing other machine learning and deep learning approaches. Close behind are XGBoost (accuracy: 0.90), decision tree (accuracy: 0.86), and logistic regression (accuracy: 0.70). Additionally, our deep learning sequential model demonstrates promising classification performance, with an accuracy of 0.80 and a loss of 0.425 on the validation set. These findings underscore the potential of machine learning and deep learning methodologies in advancing cardiovascular disease prediction and management strategies.
文摘Effects of performing an R-factor analysis of observed variables based on population models comprising R- and Q-factors were investigated. Although R-factor analysis of data based on a population model comprising R- and Q-factors is possible, this may lead to model error. Accordingly, loading estimates resulting from R-factor analysis of sample data drawn from a population based on a combination of R- and Q-factors will be biased. It was shown in a simulation study that a large amount of Q-factor variance induces an increase in the variation of R-factor loading estimates beyond the chance level. Tests of the multivariate kurtosis of observed variables are proposed as an indicator of possible Q-factor variance in observed variables as a prerequisite for R-factor analysis.
文摘Although there are many measures of variability for qualitative variables, they are little used in social research, nor are they included in statistical software. The aim of this article is to present six measures of variation for qualitative variables of simple calculation, as well as to facilitate their use by means of the R software. The measures considered are, on the one hand, Freemans variation ratio, Morals universal variation ratio, Kvalseths standard deviation from the mode, and Wilcoxs variation ratio which are most affected by proximity to a constant random variable, where the measures of variability for qualitative variables reach their minimum value of 0. On the other hand, the Gibbs-Poston index of qualitative variation and Shannons relative entropy are included, which are more affected by the proximity to a uniform distribution, where the measures of variability for qualitative variables reach their maximum value of 1. Point and interval estimation are addressed. Bootstrap by the percentile and bias-corrected and accelerated percentile methods are used to obtain confidence intervals. Two calculation situations are presented: with a sample mode and with two or more modes. The standard deviation from the mode among the six considered measures, and the universal variation ratio among the three variation ratios, are particularly recommended for use.
文摘The paper presents an innovative approach towards agricultural insurance underwriting and risk pricing through the development of an Extreme Machine Learning (ELM) Actuarial Intelligent Model. This model integrates diverse datasets, including climate change scenarios, crop types, farm sizes, and various risk factors, to automate underwriting decisions and estimate loss reserves in agricultural insurance. The study conducts extensive exploratory data analysis, model building, feature engineering, and validation to demonstrate the effectiveness of the proposed approach. Additionally, the paper discusses the application of robust tests, stress tests, and scenario tests to assess the model’s resilience and adaptability to changing market conditions. Overall, the research contributes to advancing actuarial science in agricultural insurance by leveraging advanced machine learning techniques for enhanced risk management and decision-making.
文摘This study aims to establish a rationale for the Rice University rule in determining the number of bins in a histogram. It is grounded in the Scott and Freedman-Diaconis rules. Additionally, the accuracy of the empirical histogram in reproducing the shape of the distribution is assessed with respect to three factors: the rule for determining the number of bins (square root, Sturges, Doane, Scott, Freedman-Diaconis, and Rice University), sample size, and distribution type. Three measures are utilized: the average distance between empirical and theoretical histograms, the level of recognition by an expert judge, and the accuracy index, which is composed of the two aforementioned measures. Mean comparisons are conducted with aligned rank transformation analysis of variance for three fixed-effects factors: sample size (20, 35, 50, 100, 200, 500, and 1000), distribution type (10 types), and empirical rule to determine the number of bins (6 rules). From the accuracy index, Rice’s rule improves with increasing sample size and is independent of distribution type. It outperforms the Friedman-Diaconis rule but falls short of Scott’s rule, except with the arcsine distribution. Its profile of means resembles the square root rule concerning distributions and Doane’s rule concerning sample sizes. These profiles differ from those of the Scott and Friedman-Diaconis rules, which resemble each other. Among the seven rules, Scott’s rule stands out in terms of accuracy, except for the arcsine distribution, and the square root rule is the least accurate.
文摘Background: The signal-to-noise ratio (SNR) is recognized as an index of measurements reproducibility. We derive the maximum likelihood estimators of SNR and discuss confidence interval construction on the difference between two correlated SNRs when the readings are from bivariate normal and bivariate lognormal distribution. We use the Pearsons system of curves to approximate the difference between the two estimates and use the bootstrap methods to validate the approximate distributions of the statistic of interest. Methods: The paper uses the delta method to find the first four central moments, and hence the skewness and kurtosis which are important in the determination of the parameters of the Pearsons distribution. Results: The approach is illustrated in two examples;one from veterinary microbiology and food safety data and the other on data from clinical medicine. We derived the four central moments of the target statistics, together with the bootstrap method to evaluate the parameters of Pearsons distribution. The fitted Pearsons curves of Types I and II were recommended based on the available data. The R-codes are also provided to be readily used by the readers.
文摘In this paper, the Automated Actuarial Loss Reserving Model is developed and extended using machine learning. The traditional actuarial reserving techniques are no longer compatible with the increase in technological advancement currently at hand. As a result, the development of the alternative Artificial Intelligence Based Automated Actuarial Loss Reserving Methodology which captures diverse risk profiles for various policyholders through augmenting the Micro Finance services, Auto Insurance Services and Both Services lines of business on the same platform through the computation of the Comprehensive Automated Actuarial Loss Reserves (CAALR) has been implemented in this paper. The introduction of the four further types of actuarial loss reserves to those existing in the actuarial literature seems to significantly reduce lapse rates, reduce the reinsurance costs as well as expenses and outgo. As a matter of consequence, this helps to bring together a combination of new and existing policyholders in the insurance company. The frequency severity models have been extended in this paper using ten machine learning algorithms which ultimately leads to the derivation of the proposed machine learning-based actuarial loss reserving model which remarkably performed well when compared to the traditional chain ladder actuarial reserving method using simulated data.
文摘Sample size determination typically relies on a power analysis based on a frequentist conditional approach. This latter can be seen as a particular case of the two-priors approach, which allows to build four distinct power functions to select the optimal sample size. We revise this approach when the focus is on testing a single binomial proportion. We consider exact methods and introduce a conservative criterion to account for the typical non-monotonic behavior of the power functions, when dealing with discrete data. The main purpose of this paper is to present a Shiny App providing a user-friendly, interactive tool to apply these criteria. The app also provides specific tools to elicit the analysis and the design prior distributions, which are the core of the two-priors approach.
文摘For more than a century, forecasting models have been crucial in a variety of fields. Models can offer the most accurate forecasting outcomes if error terms are normally distributed. Finding a good statistical model for time series predicting imports in Malaysia is the main target of this study. The decision made during this study mostly addresses the unrestricted error correction model (UECM), and composite model (Combined regression—ARIMA). The imports of Malaysia from the first quarter of 1991 to the third quarter of 2022 are employed in this study’s quarterly time series data. The forecasting outcomes of the current study demonstrated that the composite model offered more probabilistic data, which improved forecasting the volume of Malaysia’s imports. The composite model, and the UECM model in this study are linear models based on responses to Malaysia’s imports. Future studies might compare the performance of linear and nonlinear models in forecasting.
文摘It is acknowledged today within the scientific community that two types of actions must be considered to limit global warming: mitigation actions by reducing GHG emissions, to contain the rate of global warming, and adaptation actions to adapt societies to Climate Change, to limit losses and damages [1] [2]. As far as adaptation actions are concerned, numerical simulation, due to its results, its costs which require less investment than tests carried out on complex mechanical structures, and its implementation facilities, appears to be a major step in the design and prediction of complex mechanical systems. However, despite the quality of the results obtained, biases and inaccuracies related to the structure of the models do exist. Therefore, there is a need to validate the results of this SARIMA-LSTM-digital learning model adjusted by a matching approach, “calculating-test”, in order to assess the quality of the results and the performance of the model. The methodology consists of exploiting two climatic databases (temperature and precipitation), one of which is in-situ and the other spatial, all derived from grid points. Data from the dot grids are processed and stored in specific formats and, through machine learning approaches, complex mathematical equations are worked out and interconnections within the climate system established. Through this mathematical approach, it is possible to predict the future climate of the Sudano-Sahelian zone of Cameroon and to propose adaptation strategies.
文摘This paper considers the compound Poisson risk model perturbed by Brownian motion with variable premium and dependence between claims amounts and inter-claim times via Spearman copula. It is assumed that the insurance company’s portfolio is governed by two classes of policyholders. On the one hand, the first class where the amount of claims is high, and on the other hand, the second class where the amount of claims is low, this difference in claim amounts has significant implications for the insurance company’s pricing and risk management strategies. When policyholders are in the first class, they pay an insurance premium of a constant amount c<sub>1</sub> and when they are in the second class, the premium paid is a constant amount c<sub>2</sub> such that c<sub>1 </sub>> c<sub>2</sub>. The nature of claims (low or high) is measured via random thresholds . The study in this work will focus on the determination of the integro-differential equations satisfied by Gerber-Shiu functions and their Laplace transforms in the risk model perturbed by Brownian motion with variable premium and dependence between claims amounts and inter-claim times via Spearman copula. .
文摘The Automated Actuarial Pricing and Underwriting Model has been enhanced and expanded through the implementation of Artificial Intelligence to automate three distinct actuarial functions: loss reserving, pricing, and underwriting. This model utilizes data analytics based on Artificial Intelligence to merge microfinance and car insurance services. Introducing and applying a no-claims bonus rate system, comprising base rates, variable rates, and final rates, to three key policyholder categories significantly reduces the occurrence and impact of claims while encouraging increased premium payments. We have enhanced frequency-severity models with eight machine learning algorithms and adjusted the Automated Actuarial Pricing and Underwriting Model for inflation, resulting in outstanding performance. Among the machine learning models utilized, the Random Forest (RANGER) achieved the highest Total Aggregate Comprehensive Automated Actuarial Loss Reserve Risk Pricing Balance (ACAALRRPB), establishing itself as the preferred model for developing Automated Actuarial Underwriting models tailored to specific policyholder categories.
文摘Interrater reliability (IRR) statistics, like Cohen’s kappa, measure agreement between raters beyond what is expected by chance when classifying items into categories. While Cohen’s kappa has been widely used, it has several limitations, prompting development of Gwet’s agreement statistic, an alternative “kappa”statistic which models chance agreement via an “occasional guessing” model. However, we show that Gwet’s formula for estimating the proportion of agreement due to chance is itself biased for intermediate levels of agreement, despite overcoming limitations of Cohen’s kappa at high and low agreement levels. We derive a maximum likelihood estimator for the occasional guessing model that yields an unbiased estimator of the IRR, which we call the maximum likelihood kappa (κML). The key result is that the chance agreement probability under the occasional guessing model is simply equal to the observed rate of disagreement between raters. The κMLstatistic provides a theoretically principled approach to quantifying IRR that addresses limitations of previous κcoefficients. Given the widespread use of IRR measures, having an unbiased estimator is important for reliable inference across domains where rater judgments are analyzed.
文摘This study proposes a novel approach for estimating automobile insurance loss reserves utilizing Artificial Neural Network (ANN) techniques integrated with actuarial data intelligence. The model aims to address the challenges of accurately predicting insurance claim frequencies, severities, and overall loss reserves while accounting for inflation adjustments. Through comprehensive data analysis and model development, this research explores the effectiveness of ANN methodologies in capturing complex nonlinear relationships within insurance data. The study leverages a data set comprising automobile insurance policyholder information, claim history, and economic indicators to train and validate the ANN-based reserving model. Key aspects of the methodology include data preprocessing techniques such as one-hot encoding and scaling, followed by the construction of frequency, severity, and overall loss reserving models using ANN architectures. Moreover, the model incorporates inflation adjustment factors to ensure the accurate estimation of future loss reserves in real terms. Results from the study demonstrate the superior predictive performance of the ANN-based reserving model compared to traditional actuarial methods, with substantial improvements in accuracy and robustness. Furthermore, the model’s ability to adapt to changing market conditions and regulatory requirements, such as IFRS17, highlights its practical relevance in the insurance industry. The findings of this research contribute to the advancement of actuarial science and provide valuable insights for insurance companies seeking more accurate and efficient loss reserving techniques. The proposed ANN-based approach offers a promising avenue for enhancing risk management practices and optimizing financial decision-making processes in the automobile insurance sector.