In numerous real-world healthcare applications,handling incomplete medical data poses significant challenges for missing value imputation and subsequent clustering or classification tasks.Traditional approaches often ...In numerous real-world healthcare applications,handling incomplete medical data poses significant challenges for missing value imputation and subsequent clustering or classification tasks.Traditional approaches often rely on statistical methods for imputation,which may yield suboptimal results and be computationally intensive.This paper aims to integrate imputation and clustering techniques to enhance the classification of incomplete medical data with improved accuracy.Conventional classification methods are ill-suited for incomplete medical data.To enhance efficiency without compromising accuracy,this paper introduces a novel approach that combines imputation and clustering for the classification of incomplete data.Initially,the linear interpolation imputation method alongside an iterative Fuzzy c-means clustering method is applied and followed by a classification algorithm.The effectiveness of the proposed approach is evaluated using multiple performance metrics,including accuracy,precision,specificity,and sensitivity.The encouraging results demonstrate that our proposed method surpasses classical approaches across various performance criteria.展开更多
Due to the high inherent uncertainty of renewable energy,probabilistic day-ahead wind power forecasting is crucial for modeling and controlling the uncertainty of renewable energy smart grids in smart cities.However,t...Due to the high inherent uncertainty of renewable energy,probabilistic day-ahead wind power forecasting is crucial for modeling and controlling the uncertainty of renewable energy smart grids in smart cities.However,the accuracy and reliability of high-resolution day-ahead wind power forecasting are constrained by unreliable local weather prediction and incomplete power generation data.This article proposes a physics-informed artificial intelligence(AI)surrogates method to augment the incomplete dataset and quantify its uncertainty to improve wind power forecasting performance.The incomplete dataset,built with numerical weather prediction data,historical wind power generation,and weather factors data,is augmented based on generative adversarial networks.After augmentation,the enriched data is then fed into a multiple AI surrogates model constructed by two extreme learning machine networks to train the forecasting model for wind power.Therefore,the forecasting models’accuracy and generalization ability are improved by mining the implicit physics information from the incomplete dataset.An incomplete dataset gathered from a wind farm in North China,containing only 15 days of weather and wind power generation data withmissing points caused by occasional shutdowns,is utilized to verify the proposed method’s performance.Compared with other probabilistic forecastingmethods,the proposed method shows better accuracy and probabilistic performance on the same incomplete dataset,which highlights its potential for more flexible and sensitive maintenance of smart grids in smart cities.展开更多
With the rapid development of the economy,the scale of the power grid is expanding.The number of power equipment that constitutes the power grid has been very large,which makes the state data of power equipment grow e...With the rapid development of the economy,the scale of the power grid is expanding.The number of power equipment that constitutes the power grid has been very large,which makes the state data of power equipment grow explosively.These multi-source heterogeneous data have data differences,which lead to data variation in the process of transmission and preservation,thus forming the bad information of incomplete data.Therefore,the research on data integrity has become an urgent task.This paper is based on the characteristics of random chance and the Spatio-temporal difference of the system.According to the characteristics and data sources of the massive data generated by power equipment,the fuzzy mining model of power equipment data is established,and the data is divided into numerical and non-numerical data based on numerical data.Take the text data of power equipment defects as the mining material.Then,the Apriori algorithm based on an array is used to mine deeply.The strong association rules in incomplete data of power equipment are obtained and analyzed.From the change trend of NRMSE metrics and classification accuracy,most of the filling methods combined with the two frameworks in this method usually show a relatively stable filling trend,and will not fluctuate greatly with the growth of the missing rate.The experimental results show that the proposed algorithm model can effectively improve the filling effect of the existing filling methods on most data sets,and the filling effect fluctuates greatly with the increase of the missing rate,that is,with the increase of the missing rate,the improvement effect of the model for the existing filling methods is higher than 4.3%.Through the incomplete data clustering technology studied in this paper,a more innovative state assessment of smart grid reliability operation is carried out,which has good research value and reference significance.展开更多
A generalized flexibility–based objective function utilized for structure damage identification is constructed for solving the constrained nonlinear least squares optimized problem. To begin with, the generalized fle...A generalized flexibility–based objective function utilized for structure damage identification is constructed for solving the constrained nonlinear least squares optimized problem. To begin with, the generalized flexibility matrix (GFM) proposed to solve the damage identification problem is recalled and a modal expansion method is introduced. Next, the objective function for iterative optimization process based on the GFM is formulated, and the Trust-Region algorithm is utilized to obtain the solution of the optimization problem for multiple damage cases. And then for computing the objective function gradient, the sensitivity analysis regarding design variables is derived. In addition, due to the spatial incompleteness, the influence of stiffness reduction and incomplete modal measurement data is discussed by means of two numerical examples with several damage cases. Finally, based on the computational results, it is evident that the presented approach provides good validity and reliability for the large and complicated engineering structures.展开更多
For the fault detection and diagnosis problem in largescale industrial systems, there are two important issues: the missing data samples and the non-Gaussian property of the data. However, most of the existing data-d...For the fault detection and diagnosis problem in largescale industrial systems, there are two important issues: the missing data samples and the non-Gaussian property of the data. However, most of the existing data-driven methods cannot be able to handle both of them. Thus, a new Bayesian network classifier based fault detection and diagnosis method is proposed. At first, a non-imputation method is presented to handle the data incomplete samples, with the property of the proposed Bayesian network classifier, and the missing values can be marginalized in an elegant manner. Furthermore, the Gaussian mixture model is used to approximate the non-Gaussian data with a linear combination of finite Gaussian mixtures, so that the Bayesian network can process the non-Gaussian data in an effective way. Therefore, the entire fault detection and diagnosis method can deal with the high-dimensional incomplete process samples in an efficient and robust way. The diagnosis results are expressed in the manner of probability with the reliability scores. The proposed approach is evaluated with a benchmark problem called the Tennessee Eastman process. The simulation results show the effectiveness and robustness of the proposed method in fault detection and diagnosis for large-scale systems with missing measurements.展开更多
Data obtained from accelerated life testing (ALT) when there are two or more failure modes, which is commonly referred to as competing failure modes, are often incomplete. The incompleteness is mainly due to censori...Data obtained from accelerated life testing (ALT) when there are two or more failure modes, which is commonly referred to as competing failure modes, are often incomplete. The incompleteness is mainly due to censoring, as well as masking which might be the case that the failure time is observed, but its corresponding failure mode is not identified. Because the identification of the failure mode may be expensive, or very difficult to investigate due to lack of appropriate diagnostics. A method is proposed for analyzing incomplete data of constant stress ALT with competing failure modes. It is assumed that failure modes have s-independent latent lifetimes and the log lifetime of each failure mode can be written as a linear function of stress. The parameters of the model are estimated by using the expectation maximum (EM) algorithm with incomplete data. Simulation studies are performed to check'model validity and investigate the properties of estimates. For further validation, the method is also illustrated by an example, which shows the process of analyze incomplete data from ALT of some insulation system. Because of considering the incompleteness of data in modeling and making use of the EM algorithm in estimating, the method becomes more flexible in ALT analysis.展开更多
Energy consumption prediction of a CNC machining process is important for energy efficiency optimization strategies.To improve the generalization abilities,more and more parameters are acquired for energy prediction m...Energy consumption prediction of a CNC machining process is important for energy efficiency optimization strategies.To improve the generalization abilities,more and more parameters are acquired for energy prediction modeling.While the data collected from workshops may be incomplete because of misoperation,unstable network connections,and frequent transfers,etc.This work proposes a framework for energy modeling based on incomplete data to address this issue.First,some necessary preliminary operations are used for incomplete data sets.Then,missing values are estimated to generate a new complete data set based on generative adversarial imputation nets(GAIN).Next,the gene expression programming(GEP)algorithm is utilized to train the energy model based on the generated data sets.Finally,we test the predictive accuracy of the obtained model.Computational experiments are designed to investigate the performance of the proposed framework with different rates of missing data.Experimental results demonstrate that even when the missing data rate increases to 30%,the proposed framework can still make efficient predictions,with the corresponding RMSE and MAE 0.903 k J and 0.739 k J,respectively.展开更多
In modern industrial processes, timely detection and diagnosis of process abnormalities are critical for monitoring process operations. Various fault detection and diagnosis(FDD) methods have been proposed and impleme...In modern industrial processes, timely detection and diagnosis of process abnormalities are critical for monitoring process operations. Various fault detection and diagnosis(FDD) methods have been proposed and implemented, the performance of which, however, could be drastically influenced by the common presence of incomplete or missing data in real industrial scenarios. This paper presents a new FDD approach based on an incomplete data imputation technique for process fault recognition. It employs the modified stacked autoencoder,a deep learning structure, in the phase of incomplete data treatment, and classifies data representations rather than the imputed complete data in the phase of fault identification. A benchmark process, the Tennessee Eastman process, is employed to illustrate the effectiveness and applicability of the proposed method.展开更多
Due to the simplicity and flexibility of the power law process,it is widely used to model the failures of repairable systems.Although statistical inference on the parameters of the power law process has been well deve...Due to the simplicity and flexibility of the power law process,it is widely used to model the failures of repairable systems.Although statistical inference on the parameters of the power law process has been well developed,numerous studies largely depend on complete failure data.A few methods on incomplete data are reported to process such data,but they are limited to their specific cases,especially to that where missing data occur at the early stage of the failures.No framework to handle generic scenarios is available.To overcome this problem,from the point of view of order statistics,the statistical inference of the power law process with incomplete data is established in this paper.The theoretical derivation is carried out and the case studies demonstrate and verify the proposed method.Order statistics offer an alternative to the statistical inference of the power law process with incomplete data as they can reformulate current studies on the left censored failure data and interval censored data in a unified framework.The results show that the proposed method has more flexibility and more applicability.展开更多
Multiple kernel clustering is an unsupervised data analysis method that has been used in various scenarios where data is easy to be collected but hard to be labeled.However,multiple kernel clustering for incomplete da...Multiple kernel clustering is an unsupervised data analysis method that has been used in various scenarios where data is easy to be collected but hard to be labeled.However,multiple kernel clustering for incomplete data is a critical yet challenging task.Although the existing absent multiple kernel clustering methods have achieved remarkable performance on this task,they may fail when data has a high value-missing rate,and they may easily fall into a local optimum.To address these problems,in this paper,we propose an absent multiple kernel clustering(AMKC)method on incomplete data.The AMKC method rst clusters the initialized incomplete data.Then,it constructs a new multiple-kernel-based data space,referred to as K-space,from multiple sources to learn kernel combination coefcients.Finally,it seamlessly integrates an incomplete-kernel-imputation objective,a multiple-kernel-learning objective,and a kernel-clustering objective in order to achieve absent multiple kernel clustering.The three stages in this process are carried out simultaneously until the convergence condition is met.Experiments on six datasets with various characteristics demonstrate that the kernel imputation and clustering performance of the proposed method is signicantly better than state-of-the-art competitors.Meanwhile,the proposed method gains fast convergence speed.展开更多
The two-parameter exponential distribution can often be used to describe the lifetime of products for example, electronic components, engines and so on. This paper considers a prediction problem arising in the life te...The two-parameter exponential distribution can often be used to describe the lifetime of products for example, electronic components, engines and so on. This paper considers a prediction problem arising in the life test of key parts in high speed trains. Employing the Bayes method, a joint prior is used to describe the variability of the parameters but the form of the prior is not specified and only several moment conditions are assumed. Under the condition that the observed samples are randomly right censored, we define a statistic to predict a set of future samples which describes the average life of the second-round samples, firstly, under the condition that the censoring distribution is known and secondly, that it is unknown. For several different priors and life data sets, we demonstrate the coverage frequencies of the proposed prediction intervals as the sample size of the observed and the censoring proportion change. The numerical results show that the prediction intervals are efficient and applicable.展开更多
Data with missing values,or incomplete information,brings some challenges to the development of classification,as the incompleteness may significantly affect the performance of classifiers.In this paper,we handle miss...Data with missing values,or incomplete information,brings some challenges to the development of classification,as the incompleteness may significantly affect the performance of classifiers.In this paper,we handle missing values in both training and test sets with uncertainty and imprecision reasoning by proposing a new belief combination of classifier(BCC)method based on the evidence theory.The proposed BCC method aims to improve the classification performance of incomplete data by characterizing the uncertainty and imprecision brought by incompleteness.In BCC,different attributes are regarded as independent sources,and the collection of each attribute is considered as a subset.Then,multiple classifiers are trained with each subset independently and allow each observed attribute to provide a sub-classification result for the query pattern.Finally,these sub-classification results with different weights(discounting factors)are used to provide supplementary information to jointly determine the final classes of query patterns.The weights consist of two aspects:global and local.The global weight calculated by an optimization function is employed to represent the reliability of each classifier,and the local weight obtained by mining attribute distribution characteristics is used to quantify the importance of observed attributes to the pattern classification.Abundant comparative experiments including seven methods on twelve datasets are executed,demonstrating the out-performance of BCC over all baseline methods in terms of accuracy,precision,recall,F1 measure,with pertinent computational costs.展开更多
The data used in the process of knowledge discovery often includes noise and incomplete information. The boundaries of different classes of these data are blur and unobvious. When these data are clustered or classifie...The data used in the process of knowledge discovery often includes noise and incomplete information. The boundaries of different classes of these data are blur and unobvious. When these data are clustered or classified, we often get the coverings instead of the partitions, and it usually makes our information system insecure. In this paper, optimal partitioning of incomplete data is researched. Firstly, the relationship of set cover and set partition is discussed, and the distance between set cover and set partition is defined. Secondly, the optimal partitioning of given cover is researched by the combing and parting method, acquiring the optimal partition from three different partitions set family is discussed. Finally, the corresponding optimal algorithm is given. The real wireless signals offten contain a lot of noise, and there are many errors in boundaries when these data is clustered based on the tradional method. In our experimant, the proposed method improves correct rate greatly, and the experimental results demonstrate the method's validity.展开更多
In this paper, we discuss the theoretical validity of the observed partial likelihood (OPL) constructed in a Coxtype model under incomplete data with two class possibilities, such as missing binary covariates, a cure-...In this paper, we discuss the theoretical validity of the observed partial likelihood (OPL) constructed in a Coxtype model under incomplete data with two class possibilities, such as missing binary covariates, a cure-mixture model or doubly censored data. A main result is establishing the asymptotic convergence of the OPL. To reach this result, as it is difficult to apply some standard tools in the survival analysis, we develop tools for weak convergence based on partial-sum processes. The result of the asymptotic convergence shown here indicates that a suitable order of the number of Monte Carlo trials is less than the square of the sample size. In addition, using numerical examples, we investigate how the asymptotic properties discussed here behave in a finite sample.展开更多
Many real-life data sets are incomplete,or in different words,are affected by missing attribute values.Three interpretations of missing attribute values are discussed in the paper:lost values(erased values),attribute-...Many real-life data sets are incomplete,or in different words,are affected by missing attribute values.Three interpretations of missing attribute values are discussed in the paper:lost values(erased values),attribute-concept values(such a value may be replaced by any value from the attribute domain restricted to the concept),and "do not care" conditions(a missing attribute value may be replaced by any value from the attribute domain).For incomplete data sets three definitions of lower and upper approximations are discussed.Experiments were conducted on six typical data sets with missing attribute values,using three different interpretations of missing attribute values and the same definition of concept lower and upper approximations.The conclusion is that the best approach to missing attribute values is the lost value type.展开更多
基金supported by the Researchers Supporting Project number(RSP2024R 34),King Saud University,Riyadh,Saudi Arabia。
文摘In numerous real-world healthcare applications,handling incomplete medical data poses significant challenges for missing value imputation and subsequent clustering or classification tasks.Traditional approaches often rely on statistical methods for imputation,which may yield suboptimal results and be computationally intensive.This paper aims to integrate imputation and clustering techniques to enhance the classification of incomplete medical data with improved accuracy.Conventional classification methods are ill-suited for incomplete medical data.To enhance efficiency without compromising accuracy,this paper introduces a novel approach that combines imputation and clustering for the classification of incomplete data.Initially,the linear interpolation imputation method alongside an iterative Fuzzy c-means clustering method is applied and followed by a classification algorithm.The effectiveness of the proposed approach is evaluated using multiple performance metrics,including accuracy,precision,specificity,and sensitivity.The encouraging results demonstrate that our proposed method surpasses classical approaches across various performance criteria.
基金funded by the National Natural Science Foundation of China under Grant 62273022.
文摘Due to the high inherent uncertainty of renewable energy,probabilistic day-ahead wind power forecasting is crucial for modeling and controlling the uncertainty of renewable energy smart grids in smart cities.However,the accuracy and reliability of high-resolution day-ahead wind power forecasting are constrained by unreliable local weather prediction and incomplete power generation data.This article proposes a physics-informed artificial intelligence(AI)surrogates method to augment the incomplete dataset and quantify its uncertainty to improve wind power forecasting performance.The incomplete dataset,built with numerical weather prediction data,historical wind power generation,and weather factors data,is augmented based on generative adversarial networks.After augmentation,the enriched data is then fed into a multiple AI surrogates model constructed by two extreme learning machine networks to train the forecasting model for wind power.Therefore,the forecasting models’accuracy and generalization ability are improved by mining the implicit physics information from the incomplete dataset.An incomplete dataset gathered from a wind farm in North China,containing only 15 days of weather and wind power generation data withmissing points caused by occasional shutdowns,is utilized to verify the proposed method’s performance.Compared with other probabilistic forecastingmethods,the proposed method shows better accuracy and probabilistic performance on the same incomplete dataset,which highlights its potential for more flexible and sensitive maintenance of smart grids in smart cities.
文摘With the rapid development of the economy,the scale of the power grid is expanding.The number of power equipment that constitutes the power grid has been very large,which makes the state data of power equipment grow explosively.These multi-source heterogeneous data have data differences,which lead to data variation in the process of transmission and preservation,thus forming the bad information of incomplete data.Therefore,the research on data integrity has become an urgent task.This paper is based on the characteristics of random chance and the Spatio-temporal difference of the system.According to the characteristics and data sources of the massive data generated by power equipment,the fuzzy mining model of power equipment data is established,and the data is divided into numerical and non-numerical data based on numerical data.Take the text data of power equipment defects as the mining material.Then,the Apriori algorithm based on an array is used to mine deeply.The strong association rules in incomplete data of power equipment are obtained and analyzed.From the change trend of NRMSE metrics and classification accuracy,most of the filling methods combined with the two frameworks in this method usually show a relatively stable filling trend,and will not fluctuate greatly with the growth of the missing rate.The experimental results show that the proposed algorithm model can effectively improve the filling effect of the existing filling methods on most data sets,and the filling effect fluctuates greatly with the increase of the missing rate,that is,with the increase of the missing rate,the improvement effect of the model for the existing filling methods is higher than 4.3%.Through the incomplete data clustering technology studied in this paper,a more innovative state assessment of smart grid reliability operation is carried out,which has good research value and reference significance.
文摘A generalized flexibility–based objective function utilized for structure damage identification is constructed for solving the constrained nonlinear least squares optimized problem. To begin with, the generalized flexibility matrix (GFM) proposed to solve the damage identification problem is recalled and a modal expansion method is introduced. Next, the objective function for iterative optimization process based on the GFM is formulated, and the Trust-Region algorithm is utilized to obtain the solution of the optimization problem for multiple damage cases. And then for computing the objective function gradient, the sensitivity analysis regarding design variables is derived. In addition, due to the spatial incompleteness, the influence of stiffness reduction and incomplete modal measurement data is discussed by means of two numerical examples with several damage cases. Finally, based on the computational results, it is evident that the presented approach provides good validity and reliability for the large and complicated engineering structures.
基金supported by the National Natural Science Foundation of China(61202473)the Fundamental Research Funds for Central Universities(JUSRP111A49)+1 种基金"111 Project"(B12018)the Priority Academic Program Development of Jiangsu Higher Education Institutions
文摘For the fault detection and diagnosis problem in largescale industrial systems, there are two important issues: the missing data samples and the non-Gaussian property of the data. However, most of the existing data-driven methods cannot be able to handle both of them. Thus, a new Bayesian network classifier based fault detection and diagnosis method is proposed. At first, a non-imputation method is presented to handle the data incomplete samples, with the property of the proposed Bayesian network classifier, and the missing values can be marginalized in an elegant manner. Furthermore, the Gaussian mixture model is used to approximate the non-Gaussian data with a linear combination of finite Gaussian mixtures, so that the Bayesian network can process the non-Gaussian data in an effective way. Therefore, the entire fault detection and diagnosis method can deal with the high-dimensional incomplete process samples in an efficient and robust way. The diagnosis results are expressed in the manner of probability with the reliability scores. The proposed approach is evaluated with a benchmark problem called the Tennessee Eastman process. The simulation results show the effectiveness and robustness of the proposed method in fault detection and diagnosis for large-scale systems with missing measurements.
基金supported by Sustentation Program of National Ministries and Commissions of China (Grant No. 203020102)
文摘Data obtained from accelerated life testing (ALT) when there are two or more failure modes, which is commonly referred to as competing failure modes, are often incomplete. The incompleteness is mainly due to censoring, as well as masking which might be the case that the failure time is observed, but its corresponding failure mode is not identified. Because the identification of the failure mode may be expensive, or very difficult to investigate due to lack of appropriate diagnostics. A method is proposed for analyzing incomplete data of constant stress ALT with competing failure modes. It is assumed that failure modes have s-independent latent lifetimes and the log lifetime of each failure mode can be written as a linear function of stress. The parameters of the model are estimated by using the expectation maximum (EM) algorithm with incomplete data. Simulation studies are performed to check'model validity and investigate the properties of estimates. For further validation, the method is also illustrated by an example, which shows the process of analyze incomplete data from ALT of some insulation system. Because of considering the incompleteness of data in modeling and making use of the EM algorithm in estimating, the method becomes more flexible in ALT analysis.
基金supported in part by the National Natural Science Foundation of China(51975075)Chongqing Technology Innovation and Application Program(cstc2018jszx-cyzd X0183)。
文摘Energy consumption prediction of a CNC machining process is important for energy efficiency optimization strategies.To improve the generalization abilities,more and more parameters are acquired for energy prediction modeling.While the data collected from workshops may be incomplete because of misoperation,unstable network connections,and frequent transfers,etc.This work proposes a framework for energy modeling based on incomplete data to address this issue.First,some necessary preliminary operations are used for incomplete data sets.Then,missing values are estimated to generate a new complete data set based on generative adversarial imputation nets(GAIN).Next,the gene expression programming(GEP)algorithm is utilized to train the energy model based on the generated data sets.Finally,we test the predictive accuracy of the obtained model.Computational experiments are designed to investigate the performance of the proposed framework with different rates of missing data.Experimental results demonstrate that even when the missing data rate increases to 30%,the proposed framework can still make efficient predictions,with the corresponding RMSE and MAE 0.903 k J and 0.739 k J,respectively.
基金supported by the National Natural Science Foundation of China(61433001)Tsinghua University Initiative Scientific Research Program。
文摘In modern industrial processes, timely detection and diagnosis of process abnormalities are critical for monitoring process operations. Various fault detection and diagnosis(FDD) methods have been proposed and implemented, the performance of which, however, could be drastically influenced by the common presence of incomplete or missing data in real industrial scenarios. This paper presents a new FDD approach based on an incomplete data imputation technique for process fault recognition. It employs the modified stacked autoencoder,a deep learning structure, in the phase of incomplete data treatment, and classifies data representations rather than the imputed complete data in the phase of fault identification. A benchmark process, the Tennessee Eastman process, is employed to illustrate the effectiveness and applicability of the proposed method.
基金supported by the National Natural Science Foundation of China(51775090)。
文摘Due to the simplicity and flexibility of the power law process,it is widely used to model the failures of repairable systems.Although statistical inference on the parameters of the power law process has been well developed,numerous studies largely depend on complete failure data.A few methods on incomplete data are reported to process such data,but they are limited to their specific cases,especially to that where missing data occur at the early stage of the failures.No framework to handle generic scenarios is available.To overcome this problem,from the point of view of order statistics,the statistical inference of the power law process with incomplete data is established in this paper.The theoretical derivation is carried out and the case studies demonstrate and verify the proposed method.Order statistics offer an alternative to the statistical inference of the power law process with incomplete data as they can reformulate current studies on the left censored failure data and interval censored data in a unified framework.The results show that the proposed method has more flexibility and more applicability.
基金funded by National Natural Science Foundation of China under Grant Nos.61972057 and U1836208Hunan Provincial Natural Science Foundation of China under Grant No.2019JJ50655+3 种基金Scientic Research Foundation of Hunan Provincial Education Department of China under Grant No.18B160Open Fund of Hunan Key Laboratory of Smart Roadway and Cooperative Vehicle Infrastructure Systems(Changsha University of Science and Technology)under Grant No.kfj180402the“Double First-class”International Cooperation and Development Scientic Research Project of Changsha University of Science and Technology under Grant No.2018IC25the Researchers Supporting Project No.(RSP-2020/102)King Saud University,Riyadh,Saudi Arabia.
文摘Multiple kernel clustering is an unsupervised data analysis method that has been used in various scenarios where data is easy to be collected but hard to be labeled.However,multiple kernel clustering for incomplete data is a critical yet challenging task.Although the existing absent multiple kernel clustering methods have achieved remarkable performance on this task,they may fail when data has a high value-missing rate,and they may easily fall into a local optimum.To address these problems,in this paper,we propose an absent multiple kernel clustering(AMKC)method on incomplete data.The AMKC method rst clusters the initialized incomplete data.Then,it constructs a new multiple-kernel-based data space,referred to as K-space,from multiple sources to learn kernel combination coefcients.Finally,it seamlessly integrates an incomplete-kernel-imputation objective,a multiple-kernel-learning objective,and a kernel-clustering objective in order to achieve absent multiple kernel clustering.The three stages in this process are carried out simultaneously until the convergence condition is met.Experiments on six datasets with various characteristics demonstrate that the kernel imputation and clustering performance of the proposed method is signicantly better than state-of-the-art competitors.Meanwhile,the proposed method gains fast convergence speed.
文摘The two-parameter exponential distribution can often be used to describe the lifetime of products for example, electronic components, engines and so on. This paper considers a prediction problem arising in the life test of key parts in high speed trains. Employing the Bayes method, a joint prior is used to describe the variability of the parameters but the form of the prior is not specified and only several moment conditions are assumed. Under the condition that the observed samples are randomly right censored, we define a statistic to predict a set of future samples which describes the average life of the second-round samples, firstly, under the condition that the censoring distribution is known and secondly, that it is unknown. For several different priors and life data sets, we demonstrate the coverage frequencies of the proposed prediction intervals as the sample size of the observed and the censoring proportion change. The numerical results show that the prediction intervals are efficient and applicable.
基金supported in part by the Center-initiated Research Project and Research Initiation Project of Zhejiang Laboratory(113012-AL2201,113012-PI2103)the National Natural Science Foundation of China(61300167,61976120)+2 种基金the Natural Science Foundation of Jiangsu Province(BK20191445)the Natural Science Key Foundation of Jiangsu Education Department(21KJA510004)Qing Lan Project of Jiangsu Province。
文摘Data with missing values,or incomplete information,brings some challenges to the development of classification,as the incompleteness may significantly affect the performance of classifiers.In this paper,we handle missing values in both training and test sets with uncertainty and imprecision reasoning by proposing a new belief combination of classifier(BCC)method based on the evidence theory.The proposed BCC method aims to improve the classification performance of incomplete data by characterizing the uncertainty and imprecision brought by incompleteness.In BCC,different attributes are regarded as independent sources,and the collection of each attribute is considered as a subset.Then,multiple classifiers are trained with each subset independently and allow each observed attribute to provide a sub-classification result for the query pattern.Finally,these sub-classification results with different weights(discounting factors)are used to provide supplementary information to jointly determine the final classes of query patterns.The weights consist of two aspects:global and local.The global weight calculated by an optimization function is employed to represent the reliability of each classifier,and the local weight obtained by mining attribute distribution characteristics is used to quantify the importance of observed attributes to the pattern classification.Abundant comparative experiments including seven methods on twelve datasets are executed,demonstrating the out-performance of BCC over all baseline methods in terms of accuracy,precision,recall,F1 measure,with pertinent computational costs.
基金Supported by the National Natural Science Foundation of China (No. 61273302)partially by the Natural Science Foundation of Anhui Province (No. 1208085MF98, 1208085MF94)
文摘The data used in the process of knowledge discovery often includes noise and incomplete information. The boundaries of different classes of these data are blur and unobvious. When these data are clustered or classified, we often get the coverings instead of the partitions, and it usually makes our information system insecure. In this paper, optimal partitioning of incomplete data is researched. Firstly, the relationship of set cover and set partition is discussed, and the distance between set cover and set partition is defined. Secondly, the optimal partitioning of given cover is researched by the combing and parting method, acquiring the optimal partition from three different partitions set family is discussed. Finally, the corresponding optimal algorithm is given. The real wireless signals offten contain a lot of noise, and there are many errors in boundaries when these data is clustered based on the tradional method. In our experimant, the proposed method improves correct rate greatly, and the experimental results demonstrate the method's validity.
文摘In this paper, we discuss the theoretical validity of the observed partial likelihood (OPL) constructed in a Coxtype model under incomplete data with two class possibilities, such as missing binary covariates, a cure-mixture model or doubly censored data. A main result is establishing the asymptotic convergence of the OPL. To reach this result, as it is difficult to apply some standard tools in the survival analysis, we develop tools for weak convergence based on partial-sum processes. The result of the asymptotic convergence shown here indicates that a suitable order of the number of Monte Carlo trials is less than the square of the sample size. In addition, using numerical examples, we investigate how the asymptotic properties discussed here behave in a finite sample.
文摘Many real-life data sets are incomplete,or in different words,are affected by missing attribute values.Three interpretations of missing attribute values are discussed in the paper:lost values(erased values),attribute-concept values(such a value may be replaced by any value from the attribute domain restricted to the concept),and "do not care" conditions(a missing attribute value may be replaced by any value from the attribute domain).For incomplete data sets three definitions of lower and upper approximations are discussed.Experiments were conducted on six typical data sets with missing attribute values,using three different interpretations of missing attribute values and the same definition of concept lower and upper approximations.The conclusion is that the best approach to missing attribute values is the lost value type.