The estimation of covariance matrices is very important in many fields, such as statistics. In real applications, data are frequently influenced by high dimensions and noise. However, most relevant studies are based o...The estimation of covariance matrices is very important in many fields, such as statistics. In real applications, data are frequently influenced by high dimensions and noise. However, most relevant studies are based on complete data. This paper studies the optimal estimation of high-dimensional covariance matrices based on missing and noisy sample under the norm. First, the model with sub-Gaussian additive noise is presented. The generalized sample covariance is then modified to define a hard thresholding estimator , and the minimax upper bound is derived. After that, the minimax lower bound is derived, and it is concluded that the estimator presented in this article is rate-optimal. Finally, numerical simulation analysis is performed. The result shows that for missing samples with sub-Gaussian noise, if the true covariance matrix is sparse, the hard thresholding estimator outperforms the traditional estimate method.展开更多
As a crucial data preprocessing method in data mining,feature selection(FS)can be regarded as a bi-objective optimization problem that aims to maximize classification accuracy and minimize the number of selected featu...As a crucial data preprocessing method in data mining,feature selection(FS)can be regarded as a bi-objective optimization problem that aims to maximize classification accuracy and minimize the number of selected features.Evolutionary computing(EC)is promising for FS owing to its powerful search capability.However,in traditional EC-based methods,feature subsets are represented via a length-fixed individual encoding.It is ineffective for high-dimensional data,because it results in a huge search space and prohibitive training time.This work proposes a length-adaptive non-dominated sorting genetic algorithm(LA-NSGA)with a length-variable individual encoding and a length-adaptive evolution mechanism for bi-objective highdimensional FS.In LA-NSGA,an initialization method based on correlation and redundancy is devised to initialize individuals of diverse lengths,and a Pareto dominance-based length change operator is introduced to guide individuals to explore in promising search space adaptively.Moreover,a dominance-based local search method is employed for further improvement.The experimental results based on 12 high-dimensional gene datasets show that the Pareto front of feature subsets produced by LA-NSGA is superior to those of existing algorithms.展开更多
The performance of conventional similarity measurement methods is affected seriously by the curse of dimensionality of high-dimensional data.The reason is that data difference between sparse and noisy dimensionalities...The performance of conventional similarity measurement methods is affected seriously by the curse of dimensionality of high-dimensional data.The reason is that data difference between sparse and noisy dimensionalities occupies a large proportion of the similarity,leading to the dissimilarities between any results.A similarity measurement method of high-dimensional data based on normalized net lattice subspace is proposed.The data range of each dimension is divided into several intervals,and the components in different dimensions are mapped onto the corresponding interval.Only the component in the same or adjacent interval is used to calculate the similarity.To validate this method,three data types are used,and seven common similarity measurement methods are compared.The experimental result indicates that the relative difference of the method is increasing with the dimensionality and is approximately two or three orders of magnitude higher than the conventional method.In addition,the similarity range of this method in different dimensions is [0,1],which is fit for similarity analysis after dimensionality reduction.展开更多
In this paper,an Observation Points Classifier Ensemble(OPCE)algorithm is proposed to deal with High-Dimensional Imbalanced Classification(HDIC)problems based on data processed using the Multi-Dimensional Scaling(MDS)...In this paper,an Observation Points Classifier Ensemble(OPCE)algorithm is proposed to deal with High-Dimensional Imbalanced Classification(HDIC)problems based on data processed using the Multi-Dimensional Scaling(MDS)feature extraction technique.First,dimensionality of the original imbalanced data is reduced using MDS so that distances between any two different samples are preserved as well as possible.Second,a novel OPCE algorithm is applied to classify imbalanced samples by placing optimised observation points in a low-dimensional data space.Third,optimization of the observation point mappings is carried out to obtain a reliable assessment of the unknown samples.Exhaustive experiments have been conducted to evaluate the feasibility,rationality,and effectiveness of the proposed OPCE algorithm using seven benchmark HDIC data sets.Experimental results show that(1)the OPCE algorithm can be trained faster on low-dimensional imbalanced data than on high-dimensional data;(2)the OPCE algorithm can correctly identify samples as the number of optimised observation points is increased;and(3)statistical analysis reveals that OPCE yields better HDIC performances on the selected data sets in comparison with eight other HDIC algorithms.This demonstrates that OPCE is a viable algorithm to deal with HDIC problems.展开更多
Viticulturists traditionally have a keen interest in studying the relationship between the biochemistry of grapevines’ leaves/petioles and their associated spectral reflectance in order to understand the fruit ripeni...Viticulturists traditionally have a keen interest in studying the relationship between the biochemistry of grapevines’ leaves/petioles and their associated spectral reflectance in order to understand the fruit ripening rate, water status, nutrient levels, and disease risk. In this paper, we implement imaging spectroscopy (hyperspectral) reflectance data, for the reflective 330 - 2510 nm wavelength region (986 total spectral bands), to assess vineyard nutrient status;this constitutes a high dimensional dataset with a covariance matrix that is ill-conditioned. The identification of the variables (wavelength bands) that contribute useful information for nutrient assessment and prediction, plays a pivotal role in multivariate statistical modeling. In recent years, researchers have successfully developed many continuous, nearly unbiased, sparse and accurate variable selection methods to overcome this problem. This paper compares four regularized and one functional regression methods: Elastic Net, Multi-Step Adaptive Elastic Net, Minimax Concave Penalty, iterative Sure Independence Screening, and Functional Data Analysis for wavelength variable selection. Thereafter, the predictive performance of these regularized sparse models is enhanced using the stepwise regression. This comparative study of regression methods using a high-dimensional and highly correlated grapevine hyperspectral dataset revealed that the performance of Elastic Net for variable selection yields the best predictive ability.展开更多
Making accurate forecast or prediction is a challenging task in the big data era, in particular for those datasets involving high-dimensional variables but short-term time series points,which are generally available f...Making accurate forecast or prediction is a challenging task in the big data era, in particular for those datasets involving high-dimensional variables but short-term time series points,which are generally available from real-world systems.To address this issue, Prof.展开更多
Multi-area combined economic/emission dispatch(MACEED)problems are generally studied using analytical functions.However,as the scale of power systems increases,ex isting solutions become time-consuming and may not mee...Multi-area combined economic/emission dispatch(MACEED)problems are generally studied using analytical functions.However,as the scale of power systems increases,ex isting solutions become time-consuming and may not meet oper ational constraints.To overcome excessive computational ex pense in high-dimensional MACEED problems,a novel data-driven surrogate-assisted method is proposed.First,a cosine-similarity-based deep belief network combined with a back-propagation(DBN+BP)neural network is utilized to replace cost and emission functions.Second,transfer learning is applied with a pretraining and fine-tuning method to improve DBN+BP regression surrogate models,thus realizing fast con struction of surrogate models between different regional power systems.Third,a multi-objective antlion optimizer with a novel general single-dimension retention bi-objective optimization poli cy is proposed to execute MACEED optimization to obtain scheduling decisions.The proposed method not only ensures the convergence,uniformity,and extensibility of the Pareto front,but also greatly reduces the computational time.Finally,a 4-ar ea 40-unit test system with different constraints is employed to demonstrate the effectiveness of the proposed method.展开更多
Latent factor(LF) models are highly effective in extracting useful knowledge from High-Dimensional and Sparse(HiDS) matrices which are commonly seen in various industrial applications. An LF model usually adopts itera...Latent factor(LF) models are highly effective in extracting useful knowledge from High-Dimensional and Sparse(HiDS) matrices which are commonly seen in various industrial applications. An LF model usually adopts iterative optimizers,which may consume many iterations to achieve a local optima,resulting in considerable time cost. Hence, determining how to accelerate the training process for LF models has become a significant issue. To address this, this work proposes a randomized latent factor(RLF) model. It incorporates the principle of randomized learning techniques from neural networks into the LF analysis of HiDS matrices, thereby greatly alleviating computational burden. It also extends a standard learning process for randomized neural networks in context of LF analysis to make the resulting model represent an HiDS matrix correctly.Experimental results on three HiDS matrices from industrial applications demonstrate that compared with state-of-the-art LF models, RLF is able to achieve significantly higher computational efficiency and comparable prediction accuracy for missing data.I provides an important alternative approach to LF analysis of HiDS matrices, which is especially desired for industrial applications demanding highly efficient models.展开更多
Aimed at the issue that traditional clustering methods are not appropriate to high-dimensional data, a cuckoo search fuzzy-weighting algorithm for subspace clustering is presented on the basis of the exited soft subsp...Aimed at the issue that traditional clustering methods are not appropriate to high-dimensional data, a cuckoo search fuzzy-weighting algorithm for subspace clustering is presented on the basis of the exited soft subspace clustering algorithm. In the proposed algorithm, a novel objective function is firstly designed by considering the fuzzy weighting within-cluster compactness and the between-cluster separation, and loosening the constraints of dimension weight matrix. Then gradual membership and improved Cuckoo search, a global search strategy, are introduced to optimize the objective function and search subspace clusters, giving novel learning rules for clustering. At last, the performance of the proposed algorithm on the clustering analysis of various low and high dimensional datasets is experimentally compared with that of several competitive subspace clustering algorithms. Experimental studies demonstrate that the proposed algorithm can obtain better performance than most of the existing soft subspace clustering algorithms.展开更多
This paper studies the re-adjusted cross-validation method and a semiparametric regression model called the varying index coefficient model. We use the profile spline modal estimator method to estimate the coefficient...This paper studies the re-adjusted cross-validation method and a semiparametric regression model called the varying index coefficient model. We use the profile spline modal estimator method to estimate the coefficients of the parameter part of the Varying Index Coefficient Model (VICM), while the unknown function part uses the B-spline to expand. Moreover, we combine the above two estimation methods under the assumption of high-dimensional data. The results of data simulation and empirical analysis show that for the varying index coefficient model, the re-adjusted cross-validation method is better in terms of accuracy and stability than traditional methods based on ordinary least squares.展开更多
This paper aims to address the problems of data imbalance,parame-ter adjustment complexity,and low accuracy in high-dimensional data anomaly detection.To address these issues,an autoencoder and data augmentation-based...This paper aims to address the problems of data imbalance,parame-ter adjustment complexity,and low accuracy in high-dimensional data anomaly detection.To address these issues,an autoencoder and data augmentation-based anomaly detection model for high-dimensional sparse data is proposed(SEAOD).First,the model solves the problem of imbalanced data by using the weighted SMOTE algorithm and ENN algorithm tofill in the minority class samples and generate a new dataset.Then,an attention mechanism is employed to calculate the feature similarity and determine the structure of the neural network so that the model can learn the data features.Finally,the data are dimensionally reduced based on the autoencoder,and the sparse high-dimensional data are mapped to a low-dimensional space for anomaly detection,overcoming the impact of the curse of dimensionality on detection algorithms.The experimental results show that on 15 public datasets,this model outperforms other comparison algorithms.Furthermore,it was validated on industrial air quality datasets and achieved the expected results with practicality.展开更多
Background:Data from RNA-seq experiments provide a wealth of information about the transcriptome of an organism.However,the analysis of such data is very demanding.In this study,we aimed to establish robust analysis p...Background:Data from RNA-seq experiments provide a wealth of information about the transcriptome of an organism.However,the analysis of such data is very demanding.In this study,we aimed to establish robust analysis procedures that can be used in clinical practice.Methods:We studied RNA-seq data from triple-negative breast cancer patients.Specifically,we investigated the subsampling of RNA-seq data.Results:The main results of our investigations are as follows:(1) the subsampling of RNA-seq data gave biologically realistic simulations of sequencing experiments with smaller sequencing depth but not direct scaling of count matrices;(2) the saturation of results required an average sequencing depth larger than 32 million reads and an individual sequencing depth larger than 46 million reads;and(3) for an abrogated feature selection,higher moments of the distribution of all expressed genes had a higher sensitivity for signal detection than the corresponding mean values.Conclusions:Our results reveal important characteristics of RNA-seq data that must be understood before one can apply such an approach to translational medicine.展开更多
With the advantages of MapReduce programming model in parallel computing and processing of data and tasks on large-scale clusters, a Dataaware partitioning schema in MapReduce for large-scale high-dimensional data is ...With the advantages of MapReduce programming model in parallel computing and processing of data and tasks on large-scale clusters, a Dataaware partitioning schema in MapReduce for large-scale high-dimensional data is proposed. It optimizes partition method of data blocks with the same contribution to computation in MapReduce. Using a two-stage data partitioning strategy, the data are uniformly distributed into data blocks by clustering and partitioning. The experiments show that the data-aware partitioning schema is very effective and extensible for improving the query efficiency of highdimensional data.展开更多
The count of one column for high-dimensional datasets, i.e., the number of records containing this column, has been widely used in nuinerous applications such as analyzing popular spots based on check-in location info...The count of one column for high-dimensional datasets, i.e., the number of records containing this column, has been widely used in nuinerous applications such as analyzing popular spots based on check-in location information and mining valuable items from shopping records. However, this poses a privacy threat when directly publishing this information. Differential privacy (DP), as a notable paradigm for strong privacy guarantees, is thereby adopted to publish all column counts. Prior studies have verified that truncating records or grouping columns can effectively improve the accuracy of published results. To leverage the advantages of the two techniques, we combine these studies to further boost the accuracy of published results. However, the traditional penalty function, which measures the error imported by a given pair of parameters including truncating length and group size, is so sensitive that the derived parameters deviate from the optimal parameters significantly. To output preferable parameters, we first design a smart penalty function that is less sensitive than the traditional function. Moreover, a two-phase selection method is proposed to compute these parameters efficiently, together with the improvement in accuracy. Extensive experiments on a broad spectrum of real-world datasets validate the effectiveness of our proposals.展开更多
The estimates of the high-dimensional volatility matrix based on high-frequency data play a pivotal role in many financial applications.However,most existing studies have been built on the sub-Gaussian and cross-secti...The estimates of the high-dimensional volatility matrix based on high-frequency data play a pivotal role in many financial applications.However,most existing studies have been built on the sub-Gaussian and cross-sectional independence assumptions of microstructure noise,which are typically violated in the financial markets.In this paper,the authors proposed a new robust volatility matrix estimator,with very mild assumptions on the cross-sectional dependence and tail behaviors of the noises,and demonstrated that it can achieve the optimal convergence rate n-1/4.Furthermore,the proposed model offered better explanatory and predictive powers by decomposing the estimator into low-rank and sparse components,using an appropriate regularization procedure.Simulation studies demonstrated that the proposed estimator outperforms its competitors under various dependence structures of microstructure noise.Additionally,an extensive analysis of the high-frequency data for stocks in the Shenzhen Stock Exchange of China demonstrated the practical effectiveness of the estimator.展开更多
In this paper, a bionic optimization algorithm based dimension reduction method named Ant Colony Optimization -Selection (ACO-S) is proposed for high-dimensional datasets. Because microarray datasets comprise tens o...In this paper, a bionic optimization algorithm based dimension reduction method named Ant Colony Optimization -Selection (ACO-S) is proposed for high-dimensional datasets. Because microarray datasets comprise tens of thousands of features (genes), they are usually used to test the dimension reduction techniques. ACO-S consists of two stages in which two well-known ACO algorithms, namely ant system and ant colony system, are utilized to seek for genes, respectively. In the first stage, a modified ant system is used to filter the nonsignificant genes from high-dimensional space, and a number of promising genes are reserved in the next step. In the second stage, an improved ant colony system is applied to gene selection. In order to enhance the search ability of ACOs, we propose a method for calculating priori available heuristic information and design a fuzzy logic controller to dynamically adjust the number of ants in ant colony system. Furthermore, we devise another fuzzy logic controller to tune the parameter (q0) in ant colony system. We evaluate the performance of ACO-S on five microarray datasets, which have dimensions varying from 7129 to 12000. We also compare the performance of ACO-S with the results obtained from four existing well-known bionic optimization algorithms. The comparison results show that ACO-S has a notable ability to" generate a gene subset with the smallest size and salient features while yielding high classification accuracy. The comparative results generated by ACO-S adopting different classifiers are also given. The proposed method is shown to be a promising and effective tool for mining high-dimension data and mobile robot navigation.展开更多
Model average receives much attention in recent years.This paper considers the semiparametric model averaging for high-dimensional longitudinal data.To minimize the prediction error,the authors estimate the model weig...Model average receives much attention in recent years.This paper considers the semiparametric model averaging for high-dimensional longitudinal data.To minimize the prediction error,the authors estimate the model weights using a leave-subject-out cross-validation procedure.Asymptotic optimality of the proposed method is proved in the sense that leave-subject-out cross-validation achieves the lowest possible prediction loss asymptotically.Simulation studies show that the performance of the proposed model average method is much better than that of some commonly used model selection and averaging methods.展开更多
For several decades,much attention has been paid to the two-sample Behrens-Fisher(BF) problem which tests the equality of the means or mean vectors of two normal populations with unequal variance/covariance structures...For several decades,much attention has been paid to the two-sample Behrens-Fisher(BF) problem which tests the equality of the means or mean vectors of two normal populations with unequal variance/covariance structures.Little work,however,has been done for the k-sample BF problem for high dimensional data which tests the equality of the mean vectors of several high-dimensional normal populations with unequal covariance structures.In this paper we study this challenging problem via extending the famous Scheffe's transformation method,which reduces the k-sample BF problem to a one-sample problem.The induced one-sample problem can be easily tested by the classical Hotelling's T 2 test when the size of the resulting sample is very large relative to its dimensionality.For high dimensional data,however,the dimensionality of the resulting sample is often very large,and even much larger than its sample size,which makes the classical Hotelling's T 2 test not powerful or not even well defined.To overcome this diffculty,we propose and study an L2-norm based test.The asymp-totic powers of the proposed L2-norm based test and Hotelling's T 2 test are derived and theoretically compared.Methods for implementing the L2-norm based test are described.Simulation studies are conducted to compare the L2-norm based test and Hotelling's T 2 test when the latter can be well defined,and to compare the proposed implementation methods for the L2-norm based test otherwise.The methodologies are motivated and illustrated by a real data example.展开更多
Visualizing intrinsic structures of high-dimensional data is an essential task in data analysis.Over the past decades,a large number of methods have been proposed.Among all solutions,one promising way for enabling eff...Visualizing intrinsic structures of high-dimensional data is an essential task in data analysis.Over the past decades,a large number of methods have been proposed.Among all solutions,one promising way for enabling effective visual exploration is to construct a k-nearest neighbor(KNN)graph and visualize the graph in a low-dimensional space.Yet,state-of-the-art methods such as the LargeVis still suffer from two main problems when applied to large-scale data:(1)they may produce unappealing visualizations due to the non-convexity of the cost function;(2)visualizing the KNN graph is still time-consuming.In this work,we propose a novel visualization algorithm that leverages a multilevel representation to achieve a high-quality graph layout and employs a cluster-based approximation scheme to accelerate the KNN graph layout.Experiments on various large-scale datasets indicate that our approach achieves a speedup by a factor of five for KNN graph visualization compared to LargeVis and yields aesthetically pleasing visualization results.展开更多
Extracting knowledge from high-dimensional data has been notoriously difficult,primarily due to the so-called"curse of dimensionality"and the complex joint distributions of these dimensions.This is a particu...Extracting knowledge from high-dimensional data has been notoriously difficult,primarily due to the so-called"curse of dimensionality"and the complex joint distributions of these dimensions.This is a particularly profound issue for high-dimensional gravitational wave data analysis where one requires to conduct Bayesian inference and estimate joint posterior distributions.In this study,we incorporate prior physical knowledge by sampling from desired interim distributions to develop the training dataset.Accordingly,the more relevant regions of the high-dimensional feature space are covered by additional data points,such that the model can learn the subtle but important details.We adapt the normalizing flow method to be more expressive and trainable,such that the information can be effectively extracted and represented by the transformation between the prior and target distributions.Once trained,our model only takes approximately 1 s on one V100 GPU to generate thousands of samples for probabilistic inference purposes.The evaluation of our approach confirms the efficacy and efficiency of gravitational wave data inferences and points to a promising direction for similar research.The source code,specifications,and detailed procedures are publicly accessible on GitHub.展开更多
文摘The estimation of covariance matrices is very important in many fields, such as statistics. In real applications, data are frequently influenced by high dimensions and noise. However, most relevant studies are based on complete data. This paper studies the optimal estimation of high-dimensional covariance matrices based on missing and noisy sample under the norm. First, the model with sub-Gaussian additive noise is presented. The generalized sample covariance is then modified to define a hard thresholding estimator , and the minimax upper bound is derived. After that, the minimax lower bound is derived, and it is concluded that the estimator presented in this article is rate-optimal. Finally, numerical simulation analysis is performed. The result shows that for missing samples with sub-Gaussian noise, if the true covariance matrix is sparse, the hard thresholding estimator outperforms the traditional estimate method.
基金supported in part by the National Natural Science Foundation of China(62172065,62072060)。
文摘As a crucial data preprocessing method in data mining,feature selection(FS)can be regarded as a bi-objective optimization problem that aims to maximize classification accuracy and minimize the number of selected features.Evolutionary computing(EC)is promising for FS owing to its powerful search capability.However,in traditional EC-based methods,feature subsets are represented via a length-fixed individual encoding.It is ineffective for high-dimensional data,because it results in a huge search space and prohibitive training time.This work proposes a length-adaptive non-dominated sorting genetic algorithm(LA-NSGA)with a length-variable individual encoding and a length-adaptive evolution mechanism for bi-objective highdimensional FS.In LA-NSGA,an initialization method based on correlation and redundancy is devised to initialize individuals of diverse lengths,and a Pareto dominance-based length change operator is introduced to guide individuals to explore in promising search space adaptively.Moreover,a dominance-based local search method is employed for further improvement.The experimental results based on 12 high-dimensional gene datasets show that the Pareto front of feature subsets produced by LA-NSGA is superior to those of existing algorithms.
基金Supported by the National Natural Science Foundation of China(No.61502475)the Importation and Development of High-Caliber Talents Project of the Beijing Municipal Institutions(No.CIT&TCD201504039)
文摘The performance of conventional similarity measurement methods is affected seriously by the curse of dimensionality of high-dimensional data.The reason is that data difference between sparse and noisy dimensionalities occupies a large proportion of the similarity,leading to the dissimilarities between any results.A similarity measurement method of high-dimensional data based on normalized net lattice subspace is proposed.The data range of each dimension is divided into several intervals,and the components in different dimensions are mapped onto the corresponding interval.Only the component in the same or adjacent interval is used to calculate the similarity.To validate this method,three data types are used,and seven common similarity measurement methods are compared.The experimental result indicates that the relative difference of the method is increasing with the dimensionality and is approximately two or three orders of magnitude higher than the conventional method.In addition,the similarity range of this method in different dimensions is [0,1],which is fit for similarity analysis after dimensionality reduction.
基金National Natural Science Foundation of China,Grant/Award Number:61972261Basic Research Foundations of Shenzhen,Grant/Award Numbers:JCYJ20210324093609026,JCYJ20200813091134001。
文摘In this paper,an Observation Points Classifier Ensemble(OPCE)algorithm is proposed to deal with High-Dimensional Imbalanced Classification(HDIC)problems based on data processed using the Multi-Dimensional Scaling(MDS)feature extraction technique.First,dimensionality of the original imbalanced data is reduced using MDS so that distances between any two different samples are preserved as well as possible.Second,a novel OPCE algorithm is applied to classify imbalanced samples by placing optimised observation points in a low-dimensional data space.Third,optimization of the observation point mappings is carried out to obtain a reliable assessment of the unknown samples.Exhaustive experiments have been conducted to evaluate the feasibility,rationality,and effectiveness of the proposed OPCE algorithm using seven benchmark HDIC data sets.Experimental results show that(1)the OPCE algorithm can be trained faster on low-dimensional imbalanced data than on high-dimensional data;(2)the OPCE algorithm can correctly identify samples as the number of optimised observation points is increased;and(3)statistical analysis reveals that OPCE yields better HDIC performances on the selected data sets in comparison with eight other HDIC algorithms.This demonstrates that OPCE is a viable algorithm to deal with HDIC problems.
文摘Viticulturists traditionally have a keen interest in studying the relationship between the biochemistry of grapevines’ leaves/petioles and their associated spectral reflectance in order to understand the fruit ripening rate, water status, nutrient levels, and disease risk. In this paper, we implement imaging spectroscopy (hyperspectral) reflectance data, for the reflective 330 - 2510 nm wavelength region (986 total spectral bands), to assess vineyard nutrient status;this constitutes a high dimensional dataset with a covariance matrix that is ill-conditioned. The identification of the variables (wavelength bands) that contribute useful information for nutrient assessment and prediction, plays a pivotal role in multivariate statistical modeling. In recent years, researchers have successfully developed many continuous, nearly unbiased, sparse and accurate variable selection methods to overcome this problem. This paper compares four regularized and one functional regression methods: Elastic Net, Multi-Step Adaptive Elastic Net, Minimax Concave Penalty, iterative Sure Independence Screening, and Functional Data Analysis for wavelength variable selection. Thereafter, the predictive performance of these regularized sparse models is enhanced using the stepwise regression. This comparative study of regression methods using a high-dimensional and highly correlated grapevine hyperspectral dataset revealed that the performance of Elastic Net for variable selection yields the best predictive ability.
基金supported by the grants from CASthe National Key R&D Program of Chinathe National Natural Science Foundation of China
文摘Making accurate forecast or prediction is a challenging task in the big data era, in particular for those datasets involving high-dimensional variables but short-term time series points,which are generally available from real-world systems.To address this issue, Prof.
文摘Multi-area combined economic/emission dispatch(MACEED)problems are generally studied using analytical functions.However,as the scale of power systems increases,ex isting solutions become time-consuming and may not meet oper ational constraints.To overcome excessive computational ex pense in high-dimensional MACEED problems,a novel data-driven surrogate-assisted method is proposed.First,a cosine-similarity-based deep belief network combined with a back-propagation(DBN+BP)neural network is utilized to replace cost and emission functions.Second,transfer learning is applied with a pretraining and fine-tuning method to improve DBN+BP regression surrogate models,thus realizing fast con struction of surrogate models between different regional power systems.Third,a multi-objective antlion optimizer with a novel general single-dimension retention bi-objective optimization poli cy is proposed to execute MACEED optimization to obtain scheduling decisions.The proposed method not only ensures the convergence,uniformity,and extensibility of the Pareto front,but also greatly reduces the computational time.Finally,a 4-ar ea 40-unit test system with different constraints is employed to demonstrate the effectiveness of the proposed method.
基金supported in part by the National Natural Science Foundation of China (6177249391646114)+1 种基金Chongqing research program of technology innovation and application (cstc2017rgzn-zdyfX0020)in part by the Pioneer Hundred Talents Program of Chinese Academy of Sciences
文摘Latent factor(LF) models are highly effective in extracting useful knowledge from High-Dimensional and Sparse(HiDS) matrices which are commonly seen in various industrial applications. An LF model usually adopts iterative optimizers,which may consume many iterations to achieve a local optima,resulting in considerable time cost. Hence, determining how to accelerate the training process for LF models has become a significant issue. To address this, this work proposes a randomized latent factor(RLF) model. It incorporates the principle of randomized learning techniques from neural networks into the LF analysis of HiDS matrices, thereby greatly alleviating computational burden. It also extends a standard learning process for randomized neural networks in context of LF analysis to make the resulting model represent an HiDS matrix correctly.Experimental results on three HiDS matrices from industrial applications demonstrate that compared with state-of-the-art LF models, RLF is able to achieve significantly higher computational efficiency and comparable prediction accuracy for missing data.I provides an important alternative approach to LF analysis of HiDS matrices, which is especially desired for industrial applications demanding highly efficient models.
基金supported in part by the National Natural Science Foundation of China (Nos. 61303074, 61309013)the Programs for Science, National Key Basic Research and Development Program ("973") of China (No. 2012CB315900)Technology Development of Henan province (Nos.12210231003, 13210231002)
文摘Aimed at the issue that traditional clustering methods are not appropriate to high-dimensional data, a cuckoo search fuzzy-weighting algorithm for subspace clustering is presented on the basis of the exited soft subspace clustering algorithm. In the proposed algorithm, a novel objective function is firstly designed by considering the fuzzy weighting within-cluster compactness and the between-cluster separation, and loosening the constraints of dimension weight matrix. Then gradual membership and improved Cuckoo search, a global search strategy, are introduced to optimize the objective function and search subspace clusters, giving novel learning rules for clustering. At last, the performance of the proposed algorithm on the clustering analysis of various low and high dimensional datasets is experimentally compared with that of several competitive subspace clustering algorithms. Experimental studies demonstrate that the proposed algorithm can obtain better performance than most of the existing soft subspace clustering algorithms.
文摘This paper studies the re-adjusted cross-validation method and a semiparametric regression model called the varying index coefficient model. We use the profile spline modal estimator method to estimate the coefficients of the parameter part of the Varying Index Coefficient Model (VICM), while the unknown function part uses the B-spline to expand. Moreover, we combine the above two estimation methods under the assumption of high-dimensional data. The results of data simulation and empirical analysis show that for the varying index coefficient model, the re-adjusted cross-validation method is better in terms of accuracy and stability than traditional methods based on ordinary least squares.
基金This work is supported by the National Key R&D Program of China under Grant No.2020YFB1710200.
文摘This paper aims to address the problems of data imbalance,parame-ter adjustment complexity,and low accuracy in high-dimensional data anomaly detection.To address these issues,an autoencoder and data augmentation-based anomaly detection model for high-dimensional sparse data is proposed(SEAOD).First,the model solves the problem of imbalanced data by using the weighted SMOTE algorithm and ENN algorithm tofill in the minority class samples and generate a new dataset.Then,an attention mechanism is employed to calculate the feature similarity and determine the structure of the neural network so that the model can learn the data features.Finally,the data are dimensionally reduced based on the autoencoder,and the sparse high-dimensional data are mapped to a low-dimensional space for anomaly detection,overcoming the impact of the curse of dimensionality on detection algorithms.The experimental results show that on 15 public datasets,this model outperforms other comparison algorithms.Furthermore,it was validated on industrial air quality datasets and achieved the expected results with practicality.
基金supported In part by the Arkansas Biosciences Institute under Grant(No.UL1TR000039)the IDeANetworks of Biomedical Research Excellence(INBRE) Grant(No.P20RR16460)
文摘Background:Data from RNA-seq experiments provide a wealth of information about the transcriptome of an organism.However,the analysis of such data is very demanding.In this study,we aimed to establish robust analysis procedures that can be used in clinical practice.Methods:We studied RNA-seq data from triple-negative breast cancer patients.Specifically,we investigated the subsampling of RNA-seq data.Results:The main results of our investigations are as follows:(1) the subsampling of RNA-seq data gave biologically realistic simulations of sequencing experiments with smaller sequencing depth but not direct scaling of count matrices;(2) the saturation of results required an average sequencing depth larger than 32 million reads and an individual sequencing depth larger than 46 million reads;and(3) for an abrogated feature selection,higher moments of the distribution of all expressed genes had a higher sensitivity for signal detection than the corresponding mean values.Conclusions:Our results reveal important characteristics of RNA-seq data that must be understood before one can apply such an approach to translational medicine.
文摘With the advantages of MapReduce programming model in parallel computing and processing of data and tasks on large-scale clusters, a Dataaware partitioning schema in MapReduce for large-scale high-dimensional data is proposed. It optimizes partition method of data blocks with the same contribution to computation in MapReduce. Using a two-stage data partitioning strategy, the data are uniformly distributed into data blocks by clustering and partitioning. The experiments show that the data-aware partitioning schema is very effective and extensible for improving the query efficiency of highdimensional data.
基金the National Natural Science Foundation of China (Grant Nos. 61433008, 61472071 and U143520006)the Fundamental Research Funds for the Central Universities of China (161604005 and 171605001)the Natural Science Foundation of Liaoning Province (2015020018).
文摘The count of one column for high-dimensional datasets, i.e., the number of records containing this column, has been widely used in nuinerous applications such as analyzing popular spots based on check-in location information and mining valuable items from shopping records. However, this poses a privacy threat when directly publishing this information. Differential privacy (DP), as a notable paradigm for strong privacy guarantees, is thereby adopted to publish all column counts. Prior studies have verified that truncating records or grouping columns can effectively improve the accuracy of published results. To leverage the advantages of the two techniques, we combine these studies to further boost the accuracy of published results. However, the traditional penalty function, which measures the error imported by a given pair of parameters including truncating length and group size, is so sensitive that the derived parameters deviate from the optimal parameters significantly. To output preferable parameters, we first design a smart penalty function that is less sensitive than the traditional function. Moreover, a two-phase selection method is proposed to compute these parameters efficiently, together with the improvement in accuracy. Extensive experiments on a broad spectrum of real-world datasets validate the effectiveness of our proposals.
基金supported by the National Natural Science Foundation of China under Grant Nos.72271232,71873137the MOE Project of Key Research Institute of Humanities and Social Sciences under Grant No.22JJD110001+1 种基金the support of Public Computing CloudRenmin University of China。
文摘The estimates of the high-dimensional volatility matrix based on high-frequency data play a pivotal role in many financial applications.However,most existing studies have been built on the sub-Gaussian and cross-sectional independence assumptions of microstructure noise,which are typically violated in the financial markets.In this paper,the authors proposed a new robust volatility matrix estimator,with very mild assumptions on the cross-sectional dependence and tail behaviors of the noises,and demonstrated that it can achieve the optimal convergence rate n-1/4.Furthermore,the proposed model offered better explanatory and predictive powers by decomposing the estimator into low-rank and sparse components,using an appropriate regularization procedure.Simulation studies demonstrated that the proposed estimator outperforms its competitors under various dependence structures of microstructure noise.Additionally,an extensive analysis of the high-frequency data for stocks in the Shenzhen Stock Exchange of China demonstrated the practical effectiveness of the estimator.
文摘In this paper, a bionic optimization algorithm based dimension reduction method named Ant Colony Optimization -Selection (ACO-S) is proposed for high-dimensional datasets. Because microarray datasets comprise tens of thousands of features (genes), they are usually used to test the dimension reduction techniques. ACO-S consists of two stages in which two well-known ACO algorithms, namely ant system and ant colony system, are utilized to seek for genes, respectively. In the first stage, a modified ant system is used to filter the nonsignificant genes from high-dimensional space, and a number of promising genes are reserved in the next step. In the second stage, an improved ant colony system is applied to gene selection. In order to enhance the search ability of ACOs, we propose a method for calculating priori available heuristic information and design a fuzzy logic controller to dynamically adjust the number of ants in ant colony system. Furthermore, we devise another fuzzy logic controller to tune the parameter (q0) in ant colony system. We evaluate the performance of ACO-S on five microarray datasets, which have dimensions varying from 7129 to 12000. We also compare the performance of ACO-S with the results obtained from four existing well-known bionic optimization algorithms. The comparison results show that ACO-S has a notable ability to" generate a gene subset with the smallest size and salient features while yielding high classification accuracy. The comparative results generated by ACO-S adopting different classifiers are also given. The proposed method is shown to be a promising and effective tool for mining high-dimension data and mobile robot navigation.
基金the Ministry of Science and Technology of China under Grant No.2016YFB0502301Academy for Multidisciplinary Studies of Capital Normal University,and the National Natural Science Foundation of China under Grant Nos.11971323 and 11529101。
文摘Model average receives much attention in recent years.This paper considers the semiparametric model averaging for high-dimensional longitudinal data.To minimize the prediction error,the authors estimate the model weights using a leave-subject-out cross-validation procedure.Asymptotic optimality of the proposed method is proved in the sense that leave-subject-out cross-validation achieves the lowest possible prediction loss asymptotically.Simulation studies show that the performance of the proposed model average method is much better than that of some commonly used model selection and averaging methods.
基金supported by the National University of Singapore Academic Research Grant (Grant No. R-155-000-085-112)
文摘For several decades,much attention has been paid to the two-sample Behrens-Fisher(BF) problem which tests the equality of the means or mean vectors of two normal populations with unequal variance/covariance structures.Little work,however,has been done for the k-sample BF problem for high dimensional data which tests the equality of the mean vectors of several high-dimensional normal populations with unequal covariance structures.In this paper we study this challenging problem via extending the famous Scheffe's transformation method,which reduces the k-sample BF problem to a one-sample problem.The induced one-sample problem can be easily tested by the classical Hotelling's T 2 test when the size of the resulting sample is very large relative to its dimensionality.For high dimensional data,however,the dimensionality of the resulting sample is often very large,and even much larger than its sample size,which makes the classical Hotelling's T 2 test not powerful or not even well defined.To overcome this diffculty,we propose and study an L2-norm based test.The asymp-totic powers of the proposed L2-norm based test and Hotelling's T 2 test are derived and theoretically compared.Methods for implementing the L2-norm based test are described.Simulation studies are conducted to compare the L2-norm based test and Hotelling's T 2 test when the latter can be well defined,and to compare the proposed implementation methods for the L2-norm based test otherwise.The methodologies are motivated and illustrated by a real data example.
文摘Visualizing intrinsic structures of high-dimensional data is an essential task in data analysis.Over the past decades,a large number of methods have been proposed.Among all solutions,one promising way for enabling effective visual exploration is to construct a k-nearest neighbor(KNN)graph and visualize the graph in a low-dimensional space.Yet,state-of-the-art methods such as the LargeVis still suffer from two main problems when applied to large-scale data:(1)they may produce unappealing visualizations due to the non-convexity of the cost function;(2)visualizing the KNN graph is still time-consuming.In this work,we propose a novel visualization algorithm that leverages a multilevel representation to achieve a high-quality graph layout and employs a cluster-based approximation scheme to accelerate the KNN graph layout.Experiments on various large-scale datasets indicate that our approach achieves a speedup by a factor of five for KNN graph visualization compared to LargeVis and yields aesthetically pleasing visualization results.
基金supported by the Peng Cheng Laboratory Cloud Brain(No.PCL2021A13)the National Natural Science Foundation of China(Nos.11721303,12075297,and 11690021)the Strategic Priority Research Program of the Chinese Academy of Sciences(No.XDA1502110202)
文摘Extracting knowledge from high-dimensional data has been notoriously difficult,primarily due to the so-called"curse of dimensionality"and the complex joint distributions of these dimensions.This is a particularly profound issue for high-dimensional gravitational wave data analysis where one requires to conduct Bayesian inference and estimate joint posterior distributions.In this study,we incorporate prior physical knowledge by sampling from desired interim distributions to develop the training dataset.Accordingly,the more relevant regions of the high-dimensional feature space are covered by additional data points,such that the model can learn the subtle but important details.We adapt the normalizing flow method to be more expressive and trainable,such that the information can be effectively extracted and represented by the transformation between the prior and target distributions.Once trained,our model only takes approximately 1 s on one V100 GPU to generate thousands of samples for probabilistic inference purposes.The evaluation of our approach confirms the efficacy and efficiency of gravitational wave data inferences and points to a promising direction for similar research.The source code,specifications,and detailed procedures are publicly accessible on GitHub.