The problem of missing values has long been studied by researchers working in areas of data science and bioinformatics,especially the analysis of gene expression data that facilitates an early detection of cancer.Many...The problem of missing values has long been studied by researchers working in areas of data science and bioinformatics,especially the analysis of gene expression data that facilitates an early detection of cancer.Many attempts show improvements made by excluding samples with missing information from the analysis process,while others have tried to fill the gaps with possible values.While the former is simple,the latter safeguards information loss.For that,a neighbour-based(KNN)approach has proven more effective than other global estimators.The paper extends this further by introducing a new summarizationmethod to theKNNmodel.It is the first study that applies the concept of ordered weighted averaging(OWA)operator to such a problem context.In particular,two variations of OWA aggregation are proposed and evaluated against their baseline and other neighbor-based models.Using different ratios of missing values from 1%-20%and a set of six published gene expression datasets,the experimental results suggest that newmethods usually provide more accurate estimates than those compared methods.Specific to the missing rates of 5%and 20%,the best NRMSE scores as averages across datasets is 0.65 and 0.69,while the highest measures obtained by existing techniques included in this study are 0.80 and 0.84,respectively.展开更多
Most real application processes belong to a complex nonlinear system with incomplete information. It is difficult to estimate a model by assuming that the data set is governed by a global model. Moreover, in real proc...Most real application processes belong to a complex nonlinear system with incomplete information. It is difficult to estimate a model by assuming that the data set is governed by a global model. Moreover, in real processes, the available data set is usually obtained with missing values. To overcome the shortcomings of global modeling and missing data values, a new modeling method is proposed. Firstly, an incomplete data set with missing values is partitioned into several clusters by a K-means with soft constraints (KSC) algorithm, which incorporates soft constraints to enable clustering with missing values. Then a local model based on each group is developed by using SVR algorithm, which adopts a missing value insensitive (MVI) kernel to investigate the missing value estimation problem. For each local model, its valid area is gotten as well. Simulation results prove the effectiveness of the current local model and the estimation algorithm.展开更多
The frequent missing values in radar-derived time-series tracks of aerial targets(RTT-AT)lead to significant challenges in subsequent data-driven tasks.However,the majority of imputation research focuses on random mis...The frequent missing values in radar-derived time-series tracks of aerial targets(RTT-AT)lead to significant challenges in subsequent data-driven tasks.However,the majority of imputation research focuses on random missing(RM)that differs significantly from common missing patterns of RTT-AT.The method for solving the RM may experience performance degradation or failure when applied to RTT-AT imputation.Conventional autoregressive deep learning methods are prone to error accumulation and long-term dependency loss.In this paper,a non-autoregressive imputation model that addresses the issue of missing value imputation for two common missing patterns in RTT-AT is proposed.Our model consists of two probabilistic sparse diagonal masking self-attention(PSDMSA)units and a weight fusion unit.It learns missing values by combining the representations outputted by the two units,aiming to minimize the difference between the missing values and their actual values.The PSDMSA units effectively capture temporal dependencies and attribute correlations between time steps,improving imputation quality.The weight fusion unit automatically updates the weights of the output representations from the two units to obtain a more accurate final representation.The experimental results indicate that,despite varying missing rates in the two missing patterns,our model consistently outperforms other methods in imputation performance and exhibits a low frequency of deviations in estimates for specific missing entries.Compared to the state-of-the-art autoregressive deep learning imputation model Bidirectional Recurrent Imputation for Time Series(BRITS),our proposed model reduces mean absolute error(MAE)by 31%~50%.Additionally,the model attains a training speed that is 4 to 8 times faster when compared to both BRITS and a standard Transformer model when trained on the same dataset.Finally,the findings from the ablation experiments demonstrate that the PSDMSA,the weight fusion unit,cascade network design,and imputation loss enhance imputation performance and confirm the efficacy of our design.展开更多
Recently,the importance of data analysis has increased significantly due to the rapid data increase.In particular,vehicle communication data,considered a significant challenge in Intelligent Transportation Systems(ITS...Recently,the importance of data analysis has increased significantly due to the rapid data increase.In particular,vehicle communication data,considered a significant challenge in Intelligent Transportation Systems(ITS),has spatiotemporal characteristics and many missing values.High missing values in data lead to the decreased predictive performance of models.Existing missing value imputation models ignore the topology of transportation net-works due to the structural connection of road networks,although physical distances are close in spatiotemporal image data.Additionally,the learning process of missing value imputation models requires complete data,but there are limitations in securing complete vehicle communication data.This study proposes a missing value imputation model based on adversarial autoencoder using spatiotemporal feature extraction to address these issues.The proposed method replaces missing values by reflecting spatiotemporal characteristics of transportation data using temporal convolution and spatial convolution.Experimental results show that the proposed model has the lowest error rate of 5.92%,demonstrating excellent predictive accuracy.Through this,it is possible to solve the data sparsity problem and improve traffic safety by showing superior predictive performance.展开更多
Chronic kidney disease(CKD)is a major health concern today,requiring early and accurate diagnosis.Machine learning has emerged as a powerful tool for disease detection,and medical professionals are increasingly using ...Chronic kidney disease(CKD)is a major health concern today,requiring early and accurate diagnosis.Machine learning has emerged as a powerful tool for disease detection,and medical professionals are increasingly using ML classifier algorithms to identify CKD early.This study explores the application of advanced machine learning techniques on a CKD dataset obtained from the University of California,UC Irvine Machine Learning repository.The research introduces TrioNet,an ensemble model combining extreme gradient boosting,random forest,and extra tree classifier,which excels in providing highly accurate predictions for CKD.Furthermore,K nearest neighbor(KNN)imputer is utilized to deal withmissing values while synthetic minority oversampling(SMOTE)is used for class-imbalance problems.To ascertain the efficacy of the proposed model,a comprehensive comparative analysis is conducted with various machine learning models.The proposed TrioNet using KNN imputer and SMOTE outperformed other models with 98.97%accuracy for detectingCKD.This in-depth analysis demonstrates the model’s capabilities and underscores its potential as a valuable tool in the diagnosis of CKD.展开更多
Time series analysis is a key technology for medical diagnosis,weather forecasting and financial prediction systems.However,missing data frequently occur during data recording,posing a great challenge to data mining t...Time series analysis is a key technology for medical diagnosis,weather forecasting and financial prediction systems.However,missing data frequently occur during data recording,posing a great challenge to data mining tasks.In this study,we propose a novel time series data representation-based denoising autoencoder(DAE)for the reconstruction of missing values.Two data representation methods,namely,recurrence plot(RP)and Gramian angular field(GAF),are used to transform the raw time series to a 2D matrix for establishing the temporal correlations between different time intervals and extracting the structural patterns from the time series.Then an improved DAE is proposed to reconstruct the missing values from the 2D representation of time series.A comprehensive comparison is conducted amongst the different representations on standard datasets.Results show that the 2D representations have a lower reconstruction error than the raw time series,and the RP representation provides the best outcome.This work provides useful insights into the better reconstruction of missing values in time series analysis to considerably improve the reliability of timevarying system.展开更多
Cardiac diseases are one of the greatest global health challenges.Due to the high annual mortality rates,cardiac diseases have attracted the attention of numerous researchers in recent years.This article proposes a hy...Cardiac diseases are one of the greatest global health challenges.Due to the high annual mortality rates,cardiac diseases have attracted the attention of numerous researchers in recent years.This article proposes a hybrid fuzzy fusion classification model for cardiac arrhythmia diseases.The fusion model is utilized to optimally select the highest-ranked features generated by a variety of well-known feature-selection algorithms.An ensemble of classifiers is then applied to the fusion’s results.The proposed model classifies the arrhythmia dataset from the University of California,Irvine into normal/abnormal classes as well as 16 classes of arrhythmia.Initially,at the preprocessing steps,for the miss-valued attributes,we used the average value in the linear attributes group by the same class and the most frequent value for nominal attributes.However,in order to ensure the model optimality,we eliminated all attributes which have zero or constant values that might bias the results of utilized classifiers.The preprocessing step led to 161 out of 279 attributes(features).Thereafter,a fuzzy-based feature-selection fusion method is applied to fuse high-ranked features obtained from different heuristic feature-selection algorithms.In short,our study comprises three main blocks:(1)sensing data and preprocessing;(2)feature queuing,selection,and extraction;and(3)the predictive model.Our proposed method improves classification performance in terms of accuracy,F1measure,recall,and precision when compared to state-of-the-art techniques.It achieves 98.5%accuracy for binary class mode and 98.9%accuracy for categorized class mode.展开更多
Data with missing values,or incomplete information,brings some challenges to the development of classification,as the incompleteness may significantly affect the performance of classifiers.In this paper,we handle miss...Data with missing values,or incomplete information,brings some challenges to the development of classification,as the incompleteness may significantly affect the performance of classifiers.In this paper,we handle missing values in both training and test sets with uncertainty and imprecision reasoning by proposing a new belief combination of classifier(BCC)method based on the evidence theory.The proposed BCC method aims to improve the classification performance of incomplete data by characterizing the uncertainty and imprecision brought by incompleteness.In BCC,different attributes are regarded as independent sources,and the collection of each attribute is considered as a subset.Then,multiple classifiers are trained with each subset independently and allow each observed attribute to provide a sub-classification result for the query pattern.Finally,these sub-classification results with different weights(discounting factors)are used to provide supplementary information to jointly determine the final classes of query patterns.The weights consist of two aspects:global and local.The global weight calculated by an optimization function is employed to represent the reliability of each classifier,and the local weight obtained by mining attribute distribution characteristics is used to quantify the importance of observed attributes to the pattern classification.Abundant comparative experiments including seven methods on twelve datasets are executed,demonstrating the out-performance of BCC over all baseline methods in terms of accuracy,precision,recall,F1 measure,with pertinent computational costs.展开更多
The accuracy of the statistical learning model depends on the learning technique used which in turn depends on the dataset’s values.In most research studies,the existence of missing values(MVs)is a vital problem.In a...The accuracy of the statistical learning model depends on the learning technique used which in turn depends on the dataset’s values.In most research studies,the existence of missing values(MVs)is a vital problem.In addition,any dataset with MVs cannot be used for further analysis or with any data driven tool especially when the percentage of MVs are high.In this paper,the authors propose a novel algorithm for dealing with MVs depending on the feature selec-tion(FS)of similarity classifier with fuzzy entropy measure.The proposed algo-rithm imputes MVs in cumulative order.The candidate feature to be manipulated is selected using similarity classifier with Parkash’s fuzzy entropy measure.The predictive model to predict MVs within the candidate feature is the Bayesian Ridge Regression(BRR)technique.Furthermore,any imputed features will be incorporated within the BRR equation to impute the MVs in the next chosen incomplete feature.The proposed algorithm was compared against some practical state-of-the-art imputation methods by conducting an experiment on four medical datasets which were gathered from several databases repository with MVs gener-ated from the three missingness mechanisms.The evaluation metrics of mean abso-lute error(MAE),root mean square error(RMSE)and coefficient of determination(R2 score)were used to measure the performance.The results exhibited that perfor-mance vary depending on the size of the dataset,amount of MVs and the missing-ness mechanism type.Moreover,compared to other methods,the results showed that the proposed method gives better accuracy and less error in most cases.展开更多
Metabolomics as a research field and a set of techniques is to study the entire small molecules in biological samples.Metabolomics is emerging as a powerful tool generally for pre-cision medicine.Particularly,integrat...Metabolomics as a research field and a set of techniques is to study the entire small molecules in biological samples.Metabolomics is emerging as a powerful tool generally for pre-cision medicine.Particularly,integration of microbiome and metabolome has revealed the mechanism and functionality of microbiome in human health and disease.However,metabo-lomics data are very complicated.Preprocessing/pretreating and normalizing procedures on metabolomics data are usually required before statistical analysis.In this review article,we comprehensively review various methods that are used to preprocess and pretreat metabolo-mics data,including MS-based data and NMR-based data preprocessing,dealing with zero and/or missing values and detecting outliers,data normalization,data centering and scaling,data transformation.We discuss the advantages and limitations of each method.The choice for a suitable preprocessing method is determined by the biological hypothesis,the characteristics of the data set,and the selected statistical data analysis method.We then provide the perspective of their applications in the microbiome and metabolome research.展开更多
Complete and reliable field traffic data is vital for the planning, design, and operation of urban traf- fic management systems. However, traffic data is often very incomplete in many traffic information systems, whic...Complete and reliable field traffic data is vital for the planning, design, and operation of urban traf- fic management systems. However, traffic data is often very incomplete in many traffic information systems, which hinders effective use of the data. Methods are needed for imputing missing traffic data to minimize the effect of incomplete data on the utilization. This paper presents an improved Local Least Squares (LLS) ap- proach to impute the incomplete data. The LLS is an improved version of the K Nearest Neighbor (KNN) method. First, the missing traffic data is replaced by a row average of the known values. Then, the vector angle and Euclidean distance are used to select the nearest neighbors. Finally, a regression step is used to get weights of the nearest neighbors and the imputation results. Traffic flow volume collected in Beijing was analyzed to compare this approach with the Bayesian Principle Component Analysis (BPCA) imputation ap- proach. Tests show that this approach provides slightly better performance than BPCA imputation to impute missing traffic data.展开更多
Missing value imputation with crowdsourcing is a novel method in data cleaning to capture missing values that could hardly be filled with automatic approaches. However, the time cost and overhead in crowdsourcing are ...Missing value imputation with crowdsourcing is a novel method in data cleaning to capture missing values that could hardly be filled with automatic approaches. However, the time cost and overhead in crowdsourcing are high. Therefore, we have to reduce cost and guarantee the accuracy of crowdsourced imputation. To achieve the optimization goal, we present COSSET+, a crowdsourced framework optimized by knowledge base. We combine the advantages of both knowledge-based filter and crowdsourcing platform to capture missing values. Since the amount of crowd values will affect the cost of COSSET+, we aim to select partial missing values to be crowdsourced. We prove that the crowd value selection problem is an NP-hard problem and develop an approximation algorithm for this problem. Extensive experimental results demonstrate the efficiency and effectiveness of the proposed approaches.展开更多
Missing values occur in bio-signal processing for various reasons,including technical problems or biological char-acteristics.These missing values are then either simply excluded or substituted with estimated values f...Missing values occur in bio-signal processing for various reasons,including technical problems or biological char-acteristics.These missing values are then either simply excluded or substituted with estimated values for further processing.When the missing signal values are estimated for electroencephalography (EEG) signals,an example where electrical signals arrive quickly and successively,rapid processing of high-speed data is required for immediate decision making.In this study,we propose an incremental expectation maximization principal component analysis (iEMPCA) method that automatically estimates missing values from multivariable EEG time series data without requiring a whole and complete data set.The proposed method solves the problem of a biased model,which inevitably results from simply removing incomplete data rather than estimating them,and thus reduces the loss of information by incorporating missing values in real time.By using an incremental approach,the proposed method alsominimizes memory usage and processing time of continuously arriving data.Experimental results show that the proposed method assigns more accurate missing values than previous methods.展开更多
Multivariate time series with missing values are common in a wide range of applications,including energy data.Existing imputation methods often fail to focus on the temporal dynamics and the cross-dimensional correlat...Multivariate time series with missing values are common in a wide range of applications,including energy data.Existing imputation methods often fail to focus on the temporal dynamics and the cross-dimensional correlation simultaneously.In this paper we propose a two-step method based on an attention model to impute missing values in multivariate energy time series.First,the underlying distribution of the missing values in the data is learned.This information is then further used to train an attention based imputation model.By learning the distribution prior to the imputation process,the model can respond flexibly to the specific characteristics of the underlying data.The developed model is applied to European energy data,obtained from the European Network of Transmission System Operators for Electricity.Using different evaluation metrics and benchmarks,the conducted experiments show that the proposed model is preferable to the benchmarks and is able to accurately impute missing values.展开更多
This paper investigates the characteristics of a clinical dataset using a combination of feature selection and classification methods to handle missing values and understand the underlying statistical characteristics ...This paper investigates the characteristics of a clinical dataset using a combination of feature selection and classification methods to handle missing values and understand the underlying statistical characteristics of a typical clinical dataset. Typically, when a large clinical dataset is presented, it consists of challenges such as missing values, high dimensionality, and unbalanced classes. These pose an inherent problem when implementing feature selection and classification algorithms. With most clinical datasets, an initial exploration of the dataset is carried out, and those attributes with more than a certain percentage of missing values are eliminated from the dataset. Later, with the help of missing value imputation, feature selection and classification algorithms, prognostic and diagnostic models are developed. This paper has two main conclusions: 1) Despite the nature of clinical datasets, and their large size, methods for missing value imputation do not affect the final performance. What is crucial is that the dataset is an accurate representation of the clinical problem and those methods of imputing missing values are not critical for developing classifiers and prognostic/diagnostic models. 2) Supervised learning has proven to be more suitable for mining clinical data than unsupervised methods. It is also shown that non-parametric classifiers such as decision trees give better results when compared to parametric classifiers such as radial basis function networks(RBFNs).展开更多
基金This work is funded by Newton Institutional Links 2020-21 project:623718881,jointly by British Council and National Research Council of Thailand(www.britishcouncil.org).The corresponding author is the project PI.
文摘The problem of missing values has long been studied by researchers working in areas of data science and bioinformatics,especially the analysis of gene expression data that facilitates an early detection of cancer.Many attempts show improvements made by excluding samples with missing information from the analysis process,while others have tried to fill the gaps with possible values.While the former is simple,the latter safeguards information loss.For that,a neighbour-based(KNN)approach has proven more effective than other global estimators.The paper extends this further by introducing a new summarizationmethod to theKNNmodel.It is the first study that applies the concept of ordered weighted averaging(OWA)operator to such a problem context.In particular,two variations of OWA aggregation are proposed and evaluated against their baseline and other neighbor-based models.Using different ratios of missing values from 1%-20%and a set of six published gene expression datasets,the experimental results suggest that newmethods usually provide more accurate estimates than those compared methods.Specific to the missing rates of 5%and 20%,the best NRMSE scores as averages across datasets is 0.65 and 0.69,while the highest measures obtained by existing techniques included in this study are 0.80 and 0.84,respectively.
基金supported by Key Discipline Construction Program of Beijing Municipal Commission of Education (XK10008043)
文摘Most real application processes belong to a complex nonlinear system with incomplete information. It is difficult to estimate a model by assuming that the data set is governed by a global model. Moreover, in real processes, the available data set is usually obtained with missing values. To overcome the shortcomings of global modeling and missing data values, a new modeling method is proposed. Firstly, an incomplete data set with missing values is partitioned into several clusters by a K-means with soft constraints (KSC) algorithm, which incorporates soft constraints to enable clustering with missing values. Then a local model based on each group is developed by using SVR algorithm, which adopts a missing value insensitive (MVI) kernel to investigate the missing value estimation problem. For each local model, its valid area is gotten as well. Simulation results prove the effectiveness of the current local model and the estimation algorithm.
基金supported by Graduate Funded Project(No.JY2022A017).
文摘The frequent missing values in radar-derived time-series tracks of aerial targets(RTT-AT)lead to significant challenges in subsequent data-driven tasks.However,the majority of imputation research focuses on random missing(RM)that differs significantly from common missing patterns of RTT-AT.The method for solving the RM may experience performance degradation or failure when applied to RTT-AT imputation.Conventional autoregressive deep learning methods are prone to error accumulation and long-term dependency loss.In this paper,a non-autoregressive imputation model that addresses the issue of missing value imputation for two common missing patterns in RTT-AT is proposed.Our model consists of two probabilistic sparse diagonal masking self-attention(PSDMSA)units and a weight fusion unit.It learns missing values by combining the representations outputted by the two units,aiming to minimize the difference between the missing values and their actual values.The PSDMSA units effectively capture temporal dependencies and attribute correlations between time steps,improving imputation quality.The weight fusion unit automatically updates the weights of the output representations from the two units to obtain a more accurate final representation.The experimental results indicate that,despite varying missing rates in the two missing patterns,our model consistently outperforms other methods in imputation performance and exhibits a low frequency of deviations in estimates for specific missing entries.Compared to the state-of-the-art autoregressive deep learning imputation model Bidirectional Recurrent Imputation for Time Series(BRITS),our proposed model reduces mean absolute error(MAE)by 31%~50%.Additionally,the model attains a training speed that is 4 to 8 times faster when compared to both BRITS and a standard Transformer model when trained on the same dataset.Finally,the findings from the ablation experiments demonstrate that the PSDMSA,the weight fusion unit,cascade network design,and imputation loss enhance imputation performance and confirm the efficacy of our design.
基金supported by the MSIT (Ministry of Science and ICT),Korea,under the ITRC (Information Technology Research Center)support program (IITP-2018-0-01405)supervised by the IITP (Institute for Information&Communications Technology Planning&Evaluation).
文摘Recently,the importance of data analysis has increased significantly due to the rapid data increase.In particular,vehicle communication data,considered a significant challenge in Intelligent Transportation Systems(ITS),has spatiotemporal characteristics and many missing values.High missing values in data lead to the decreased predictive performance of models.Existing missing value imputation models ignore the topology of transportation net-works due to the structural connection of road networks,although physical distances are close in spatiotemporal image data.Additionally,the learning process of missing value imputation models requires complete data,but there are limitations in securing complete vehicle communication data.This study proposes a missing value imputation model based on adversarial autoencoder using spatiotemporal feature extraction to address these issues.The proposed method replaces missing values by reflecting spatiotemporal characteristics of transportation data using temporal convolution and spatial convolution.Experimental results show that the proposed model has the lowest error rate of 5.92%,demonstrating excellent predictive accuracy.Through this,it is possible to solve the data sparsity problem and improve traffic safety by showing superior predictive performance.
基金funded by Princess Nourah bint Abdulrahman University Researchers Supporting Project Number PNURSP2024R333,Princess Nourah bint Abdulrahman University,Riyadh,Saudi Arabia.
文摘Chronic kidney disease(CKD)is a major health concern today,requiring early and accurate diagnosis.Machine learning has emerged as a powerful tool for disease detection,and medical professionals are increasingly using ML classifier algorithms to identify CKD early.This study explores the application of advanced machine learning techniques on a CKD dataset obtained from the University of California,UC Irvine Machine Learning repository.The research introduces TrioNet,an ensemble model combining extreme gradient boosting,random forest,and extra tree classifier,which excels in providing highly accurate predictions for CKD.Furthermore,K nearest neighbor(KNN)imputer is utilized to deal withmissing values while synthetic minority oversampling(SMOTE)is used for class-imbalance problems.To ascertain the efficacy of the proposed model,a comprehensive comparative analysis is conducted with various machine learning models.The proposed TrioNet using KNN imputer and SMOTE outperformed other models with 98.97%accuracy for detectingCKD.This in-depth analysis demonstrates the model’s capabilities and underscores its potential as a valuable tool in the diagnosis of CKD.
文摘Time series analysis is a key technology for medical diagnosis,weather forecasting and financial prediction systems.However,missing data frequently occur during data recording,posing a great challenge to data mining tasks.In this study,we propose a novel time series data representation-based denoising autoencoder(DAE)for the reconstruction of missing values.Two data representation methods,namely,recurrence plot(RP)and Gramian angular field(GAF),are used to transform the raw time series to a 2D matrix for establishing the temporal correlations between different time intervals and extracting the structural patterns from the time series.Then an improved DAE is proposed to reconstruct the missing values from the 2D representation of time series.A comprehensive comparison is conducted amongst the different representations on standard datasets.Results show that the 2D representations have a lower reconstruction error than the raw time series,and the RP representation provides the best outcome.This work provides useful insights into the better reconstruction of missing values in time series analysis to considerably improve the reliability of timevarying system.
文摘Cardiac diseases are one of the greatest global health challenges.Due to the high annual mortality rates,cardiac diseases have attracted the attention of numerous researchers in recent years.This article proposes a hybrid fuzzy fusion classification model for cardiac arrhythmia diseases.The fusion model is utilized to optimally select the highest-ranked features generated by a variety of well-known feature-selection algorithms.An ensemble of classifiers is then applied to the fusion’s results.The proposed model classifies the arrhythmia dataset from the University of California,Irvine into normal/abnormal classes as well as 16 classes of arrhythmia.Initially,at the preprocessing steps,for the miss-valued attributes,we used the average value in the linear attributes group by the same class and the most frequent value for nominal attributes.However,in order to ensure the model optimality,we eliminated all attributes which have zero or constant values that might bias the results of utilized classifiers.The preprocessing step led to 161 out of 279 attributes(features).Thereafter,a fuzzy-based feature-selection fusion method is applied to fuse high-ranked features obtained from different heuristic feature-selection algorithms.In short,our study comprises three main blocks:(1)sensing data and preprocessing;(2)feature queuing,selection,and extraction;and(3)the predictive model.Our proposed method improves classification performance in terms of accuracy,F1measure,recall,and precision when compared to state-of-the-art techniques.It achieves 98.5%accuracy for binary class mode and 98.9%accuracy for categorized class mode.
基金supported in part by the Center-initiated Research Project and Research Initiation Project of Zhejiang Laboratory(113012-AL2201,113012-PI2103)the National Natural Science Foundation of China(61300167,61976120)+2 种基金the Natural Science Foundation of Jiangsu Province(BK20191445)the Natural Science Key Foundation of Jiangsu Education Department(21KJA510004)Qing Lan Project of Jiangsu Province。
文摘Data with missing values,or incomplete information,brings some challenges to the development of classification,as the incompleteness may significantly affect the performance of classifiers.In this paper,we handle missing values in both training and test sets with uncertainty and imprecision reasoning by proposing a new belief combination of classifier(BCC)method based on the evidence theory.The proposed BCC method aims to improve the classification performance of incomplete data by characterizing the uncertainty and imprecision brought by incompleteness.In BCC,different attributes are regarded as independent sources,and the collection of each attribute is considered as a subset.Then,multiple classifiers are trained with each subset independently and allow each observed attribute to provide a sub-classification result for the query pattern.Finally,these sub-classification results with different weights(discounting factors)are used to provide supplementary information to jointly determine the final classes of query patterns.The weights consist of two aspects:global and local.The global weight calculated by an optimization function is employed to represent the reliability of each classifier,and the local weight obtained by mining attribute distribution characteristics is used to quantify the importance of observed attributes to the pattern classification.Abundant comparative experiments including seven methods on twelve datasets are executed,demonstrating the out-performance of BCC over all baseline methods in terms of accuracy,precision,recall,F1 measure,with pertinent computational costs.
基金funded by the Deanship of Scientific Research(DSR)at King Abdulaziz University(KAU)Jeddah,Saudi Arabia,under grant No.(PH:13-130-1442).
文摘The accuracy of the statistical learning model depends on the learning technique used which in turn depends on the dataset’s values.In most research studies,the existence of missing values(MVs)is a vital problem.In addition,any dataset with MVs cannot be used for further analysis or with any data driven tool especially when the percentage of MVs are high.In this paper,the authors propose a novel algorithm for dealing with MVs depending on the feature selec-tion(FS)of similarity classifier with fuzzy entropy measure.The proposed algo-rithm imputes MVs in cumulative order.The candidate feature to be manipulated is selected using similarity classifier with Parkash’s fuzzy entropy measure.The predictive model to predict MVs within the candidate feature is the Bayesian Ridge Regression(BRR)technique.Furthermore,any imputed features will be incorporated within the BRR equation to impute the MVs in the next chosen incomplete feature.The proposed algorithm was compared against some practical state-of-the-art imputation methods by conducting an experiment on four medical datasets which were gathered from several databases repository with MVs gener-ated from the three missingness mechanisms.The evaluation metrics of mean abso-lute error(MAE),root mean square error(RMSE)and coefficient of determination(R2 score)were used to measure the performance.The results exhibited that perfor-mance vary depending on the size of the dataset,amount of MVs and the missing-ness mechanism type.Moreover,compared to other methods,the results showed that the proposed method gives better accuracy and less error in most cases.
基金supported by the Crohn's&Colitis Foundation Senior Research Award(No.902766 to J.S.)The National Institute of Diabetes and Digestive and Kidney Diseases(No.R01DK105118-01 and R01DK114126 to J.S.)+1 种基金United States Department of Defense Congressionally Directed Medical Research Programs(No.BC191198 to J.S.)VA Merit Award BX-19-00 to J.S.
文摘Metabolomics as a research field and a set of techniques is to study the entire small molecules in biological samples.Metabolomics is emerging as a powerful tool generally for pre-cision medicine.Particularly,integration of microbiome and metabolome has revealed the mechanism and functionality of microbiome in human health and disease.However,metabo-lomics data are very complicated.Preprocessing/pretreating and normalizing procedures on metabolomics data are usually required before statistical analysis.In this review article,we comprehensively review various methods that are used to preprocess and pretreat metabolo-mics data,including MS-based data and NMR-based data preprocessing,dealing with zero and/or missing values and detecting outliers,data normalization,data centering and scaling,data transformation.We discuss the advantages and limitations of each method.The choice for a suitable preprocessing method is determined by the biological hypothesis,the characteristics of the data set,and the selected statistical data analysis method.We then provide the perspective of their applications in the microbiome and metabolome research.
基金Partially supported by the National High-Tech Research and Development (863) Program of China (Nos. 2009AA11Z206 and 2011AA110401)the National Natural Science Foundation of China (Nos. 60721003 and 60834001)Tsinghua University Innovation Research Program (No. 2009THZ0)
文摘Complete and reliable field traffic data is vital for the planning, design, and operation of urban traf- fic management systems. However, traffic data is often very incomplete in many traffic information systems, which hinders effective use of the data. Methods are needed for imputing missing traffic data to minimize the effect of incomplete data on the utilization. This paper presents an improved Local Least Squares (LLS) ap- proach to impute the incomplete data. The LLS is an improved version of the K Nearest Neighbor (KNN) method. First, the missing traffic data is replaced by a row average of the known values. Then, the vector angle and Euclidean distance are used to select the nearest neighbors. Finally, a regression step is used to get weights of the nearest neighbors and the imputation results. Traffic flow volume collected in Beijing was analyzed to compare this approach with the Bayesian Principle Component Analysis (BPCA) imputation ap- proach. Tests show that this approach provides slightly better performance than BPCA imputation to impute missing traffic data.
文摘Missing value imputation with crowdsourcing is a novel method in data cleaning to capture missing values that could hardly be filled with automatic approaches. However, the time cost and overhead in crowdsourcing are high. Therefore, we have to reduce cost and guarantee the accuracy of crowdsourced imputation. To achieve the optimization goal, we present COSSET+, a crowdsourced framework optimized by knowledge base. We combine the advantages of both knowledge-based filter and crowdsourcing platform to capture missing values. Since the amount of crowd values will affect the cost of COSSET+, we aim to select partial missing values to be crowdsourced. We prove that the crowd value selection problem is an NP-hard problem and develop an approximation algorithm for this problem. Extensive experimental results demonstrate the efficiency and effectiveness of the proposed approaches.
基金supported by the Ministry of Knowledge Economy,Korea, under the Information Technology Research Center support program supervised by National IT Industry Promotion Agency (No.NIPA-2011-C1090-1111-0008)the Special Research Program of Chonnam National University,2009the LG Yonam Culture Foundation
文摘Missing values occur in bio-signal processing for various reasons,including technical problems or biological char-acteristics.These missing values are then either simply excluded or substituted with estimated values for further processing.When the missing signal values are estimated for electroencephalography (EEG) signals,an example where electrical signals arrive quickly and successively,rapid processing of high-speed data is required for immediate decision making.In this study,we propose an incremental expectation maximization principal component analysis (iEMPCA) method that automatically estimates missing values from multivariable EEG time series data without requiring a whole and complete data set.The proposed method solves the problem of a biased model,which inevitably results from simply removing incomplete data rather than estimating them,and thus reduces the loss of information by incorporating missing values in real time.By using an incremental approach,the proposed method alsominimizes memory usage and processing time of continuously arriving data.Experimental results show that the proposed method assigns more accurate missing values than previous methods.
文摘Multivariate time series with missing values are common in a wide range of applications,including energy data.Existing imputation methods often fail to focus on the temporal dynamics and the cross-dimensional correlation simultaneously.In this paper we propose a two-step method based on an attention model to impute missing values in multivariate energy time series.First,the underlying distribution of the missing values in the data is learned.This information is then further used to train an attention based imputation model.By learning the distribution prior to the imputation process,the model can respond flexibly to the specific characteristics of the underlying data.The developed model is applied to European energy data,obtained from the European Network of Transmission System Operators for Electricity.Using different evaluation metrics and benchmarks,the conducted experiments show that the proposed model is preferable to the benchmarks and is able to accurately impute missing values.
文摘This paper investigates the characteristics of a clinical dataset using a combination of feature selection and classification methods to handle missing values and understand the underlying statistical characteristics of a typical clinical dataset. Typically, when a large clinical dataset is presented, it consists of challenges such as missing values, high dimensionality, and unbalanced classes. These pose an inherent problem when implementing feature selection and classification algorithms. With most clinical datasets, an initial exploration of the dataset is carried out, and those attributes with more than a certain percentage of missing values are eliminated from the dataset. Later, with the help of missing value imputation, feature selection and classification algorithms, prognostic and diagnostic models are developed. This paper has two main conclusions: 1) Despite the nature of clinical datasets, and their large size, methods for missing value imputation do not affect the final performance. What is crucial is that the dataset is an accurate representation of the clinical problem and those methods of imputing missing values are not critical for developing classifiers and prognostic/diagnostic models. 2) Supervised learning has proven to be more suitable for mining clinical data than unsupervised methods. It is also shown that non-parametric classifiers such as decision trees give better results when compared to parametric classifiers such as radial basis function networks(RBFNs).