期刊文献+
共找到42篇文章
< 1 2 3 >
每页显示 20 50 100
Optimal Estimation of High-Dimensional Covariance Matrices with Missing and Noisy Data
1
作者 Meiyin Wang Wanzhou Ye 《Advances in Pure Mathematics》 2024年第4期214-227,共14页
The estimation of covariance matrices is very important in many fields, such as statistics. In real applications, data are frequently influenced by high dimensions and noise. However, most relevant studies are based o... The estimation of covariance matrices is very important in many fields, such as statistics. In real applications, data are frequently influenced by high dimensions and noise. However, most relevant studies are based on complete data. This paper studies the optimal estimation of high-dimensional covariance matrices based on missing and noisy sample under the norm. First, the model with sub-Gaussian additive noise is presented. The generalized sample covariance is then modified to define a hard thresholding estimator , and the minimax upper bound is derived. After that, the minimax lower bound is derived, and it is concluded that the estimator presented in this article is rate-optimal. Finally, numerical simulation analysis is performed. The result shows that for missing samples with sub-Gaussian noise, if the true covariance matrix is sparse, the hard thresholding estimator outperforms the traditional estimate method. 展开更多
关键词 high-dimensional Covariance Matrix Missing data Sub-Gaussian Noise Optimal Estimation
下载PDF
A Length-Adaptive Non-Dominated Sorting Genetic Algorithm for Bi-Objective High-Dimensional Feature Selection
2
作者 Yanlu Gong Junhai Zhou +2 位作者 Quanwang Wu MengChu Zhou Junhao Wen 《IEEE/CAA Journal of Automatica Sinica》 SCIE EI CSCD 2023年第9期1834-1844,共11页
As a crucial data preprocessing method in data mining,feature selection(FS)can be regarded as a bi-objective optimization problem that aims to maximize classification accuracy and minimize the number of selected featu... As a crucial data preprocessing method in data mining,feature selection(FS)can be regarded as a bi-objective optimization problem that aims to maximize classification accuracy and minimize the number of selected features.Evolutionary computing(EC)is promising for FS owing to its powerful search capability.However,in traditional EC-based methods,feature subsets are represented via a length-fixed individual encoding.It is ineffective for high-dimensional data,because it results in a huge search space and prohibitive training time.This work proposes a length-adaptive non-dominated sorting genetic algorithm(LA-NSGA)with a length-variable individual encoding and a length-adaptive evolution mechanism for bi-objective highdimensional FS.In LA-NSGA,an initialization method based on correlation and redundancy is devised to initialize individuals of diverse lengths,and a Pareto dominance-based length change operator is introduced to guide individuals to explore in promising search space adaptively.Moreover,a dominance-based local search method is employed for further improvement.The experimental results based on 12 high-dimensional gene datasets show that the Pareto front of feature subsets produced by LA-NSGA is superior to those of existing algorithms. 展开更多
关键词 Bi-objective optimization feature selection(FS) genetic algorithm high-dimensional data length-adaptive
下载PDF
Similarity measurement method of high-dimensional data based on normalized net lattice subspace 被引量:4
3
作者 李文法 Wang Gongming +1 位作者 Li Ke Huang Su 《High Technology Letters》 EI CAS 2017年第2期179-184,共6页
The performance of conventional similarity measurement methods is affected seriously by the curse of dimensionality of high-dimensional data.The reason is that data difference between sparse and noisy dimensionalities... The performance of conventional similarity measurement methods is affected seriously by the curse of dimensionality of high-dimensional data.The reason is that data difference between sparse and noisy dimensionalities occupies a large proportion of the similarity,leading to the dissimilarities between any results.A similarity measurement method of high-dimensional data based on normalized net lattice subspace is proposed.The data range of each dimension is divided into several intervals,and the components in different dimensions are mapped onto the corresponding interval.Only the component in the same or adjacent interval is used to calculate the similarity.To validate this method,three data types are used,and seven common similarity measurement methods are compared.The experimental result indicates that the relative difference of the method is increasing with the dimensionality and is approximately two or three orders of magnitude higher than the conventional method.In addition,the similarity range of this method in different dimensions is [0,1],which is fit for similarity analysis after dimensionality reduction. 展开更多
关键词 high-dimensional data the curse of dimensionality SIMILARITY NORMALIZATION SUBSPACE NPsim
下载PDF
Observation points classifier ensemble for high-dimensional imbalanced classification
4
作者 Yulin He Xu Li +3 位作者 Philippe Fournier‐Viger Joshua Zhexue Huang Mianjie Li Salman Salloum 《CAAI Transactions on Intelligence Technology》 SCIE EI 2023年第2期500-517,共18页
In this paper,an Observation Points Classifier Ensemble(OPCE)algorithm is proposed to deal with High-Dimensional Imbalanced Classification(HDIC)problems based on data processed using the Multi-Dimensional Scaling(MDS)... In this paper,an Observation Points Classifier Ensemble(OPCE)algorithm is proposed to deal with High-Dimensional Imbalanced Classification(HDIC)problems based on data processed using the Multi-Dimensional Scaling(MDS)feature extraction technique.First,dimensionality of the original imbalanced data is reduced using MDS so that distances between any two different samples are preserved as well as possible.Second,a novel OPCE algorithm is applied to classify imbalanced samples by placing optimised observation points in a low-dimensional data space.Third,optimization of the observation point mappings is carried out to obtain a reliable assessment of the unknown samples.Exhaustive experiments have been conducted to evaluate the feasibility,rationality,and effectiveness of the proposed OPCE algorithm using seven benchmark HDIC data sets.Experimental results show that(1)the OPCE algorithm can be trained faster on low-dimensional imbalanced data than on high-dimensional data;(2)the OPCE algorithm can correctly identify samples as the number of optimised observation points is increased;and(3)statistical analysis reveals that OPCE yields better HDIC performances on the selected data sets in comparison with eight other HDIC algorithms.This demonstrates that OPCE is a viable algorithm to deal with HDIC problems. 展开更多
关键词 classifier ensemble feature transformation high-dimensional data classification imbalanced learning observation point mechanism
下载PDF
Dimensionality Reduction of High-Dimensional Highly Correlated Multivariate Grapevine Dataset
5
作者 Uday Kant Jha Peter Bajorski +3 位作者 Ernest Fokoue Justine Vanden Heuvel Jan van Aardt Grant Anderson 《Open Journal of Statistics》 2017年第4期702-717,共16页
Viticulturists traditionally have a keen interest in studying the relationship between the biochemistry of grapevines’ leaves/petioles and their associated spectral reflectance in order to understand the fruit ripeni... Viticulturists traditionally have a keen interest in studying the relationship between the biochemistry of grapevines’ leaves/petioles and their associated spectral reflectance in order to understand the fruit ripening rate, water status, nutrient levels, and disease risk. In this paper, we implement imaging spectroscopy (hyperspectral) reflectance data, for the reflective 330 - 2510 nm wavelength region (986 total spectral bands), to assess vineyard nutrient status;this constitutes a high dimensional dataset with a covariance matrix that is ill-conditioned. The identification of the variables (wavelength bands) that contribute useful information for nutrient assessment and prediction, plays a pivotal role in multivariate statistical modeling. In recent years, researchers have successfully developed many continuous, nearly unbiased, sparse and accurate variable selection methods to overcome this problem. This paper compares four regularized and one functional regression methods: Elastic Net, Multi-Step Adaptive Elastic Net, Minimax Concave Penalty, iterative Sure Independence Screening, and Functional Data Analysis for wavelength variable selection. Thereafter, the predictive performance of these regularized sparse models is enhanced using the stepwise regression. This comparative study of regression methods using a high-dimensional and highly correlated grapevine hyperspectral dataset revealed that the performance of Elastic Net for variable selection yields the best predictive ability. 展开更多
关键词 high-dimensional data MULTI-STEP Adaptive Elastic Net MINIMAX CONCAVE Penalty Sure Independence Screening Functional data Analysis
下载PDF
Making Short-term High-dimensional Data Predictable
6
作者 CHEN Luonan 《Bulletin of the Chinese Academy of Sciences》 2018年第4期243-244,共2页
Making accurate forecast or prediction is a challenging task in the big data era, in particular for those datasets involving high-dimensional variables but short-term time series points,which are generally available f... Making accurate forecast or prediction is a challenging task in the big data era, in particular for those datasets involving high-dimensional variables but short-term time series points,which are generally available from real-world systems.To address this issue, Prof. 展开更多
关键词 RDE MAKING SHORT-TERM high-dimensional data Predictable
下载PDF
Data-driven Surrogate-assisted Method for High-dimensional Multi-area Combined Economic/Emission Dispatch
7
作者 Chenhao Lin Huijun Liang +2 位作者 Aokang Pang Jianwei Zhong Yongchao Yang 《Journal of Modern Power Systems and Clean Energy》 SCIE EI CSCD 2024年第1期52-64,共13页
Multi-area combined economic/emission dispatch(MACEED)problems are generally studied using analytical functions.However,as the scale of power systems increases,ex isting solutions become time-consuming and may not mee... Multi-area combined economic/emission dispatch(MACEED)problems are generally studied using analytical functions.However,as the scale of power systems increases,ex isting solutions become time-consuming and may not meet oper ational constraints.To overcome excessive computational ex pense in high-dimensional MACEED problems,a novel data-driven surrogate-assisted method is proposed.First,a cosine-similarity-based deep belief network combined with a back-propagation(DBN+BP)neural network is utilized to replace cost and emission functions.Second,transfer learning is applied with a pretraining and fine-tuning method to improve DBN+BP regression surrogate models,thus realizing fast con struction of surrogate models between different regional power systems.Third,a multi-objective antlion optimizer with a novel general single-dimension retention bi-objective optimization poli cy is proposed to execute MACEED optimization to obtain scheduling decisions.The proposed method not only ensures the convergence,uniformity,and extensibility of the Pareto front,but also greatly reduces the computational time.Finally,a 4-ar ea 40-unit test system with different constraints is employed to demonstrate the effectiveness of the proposed method. 展开更多
关键词 Multi-area combined economic/emission dispatch high-dimensional power system deep belief network data driven transfer learning
原文传递
Randomized Latent Factor Model for High-dimensional and Sparse Matrices from Industrial Applications 被引量:13
8
作者 Mingsheng Shang Xin Luo +3 位作者 Zhigang Liu Jia Chen Ye Yuan MengChu Zhou 《IEEE/CAA Journal of Automatica Sinica》 EI CSCD 2019年第1期131-141,共11页
Latent factor(LF) models are highly effective in extracting useful knowledge from High-Dimensional and Sparse(HiDS) matrices which are commonly seen in various industrial applications. An LF model usually adopts itera... Latent factor(LF) models are highly effective in extracting useful knowledge from High-Dimensional and Sparse(HiDS) matrices which are commonly seen in various industrial applications. An LF model usually adopts iterative optimizers,which may consume many iterations to achieve a local optima,resulting in considerable time cost. Hence, determining how to accelerate the training process for LF models has become a significant issue. To address this, this work proposes a randomized latent factor(RLF) model. It incorporates the principle of randomized learning techniques from neural networks into the LF analysis of HiDS matrices, thereby greatly alleviating computational burden. It also extends a standard learning process for randomized neural networks in context of LF analysis to make the resulting model represent an HiDS matrix correctly.Experimental results on three HiDS matrices from industrial applications demonstrate that compared with state-of-the-art LF models, RLF is able to achieve significantly higher computational efficiency and comparable prediction accuracy for missing data.I provides an important alternative approach to LF analysis of HiDS matrices, which is especially desired for industrial applications demanding highly efficient models. 展开更多
关键词 Big data high-dimensional and sparse matrix latent factor analysis latent factor model randomized learning
下载PDF
CSFW-SC: Cuckoo Search Fuzzy-Weighting Algorithm for Subspace Clustering Applying to High-Dimensional Clustering 被引量:1
9
作者 WANG Jindong HE Jiajing +1 位作者 ZHANG Hengwei YU Zhiyong 《China Communications》 SCIE CSCD 2015年第S2期55-63,共9页
Aimed at the issue that traditional clustering methods are not appropriate to high-dimensional data, a cuckoo search fuzzy-weighting algorithm for subspace clustering is presented on the basis of the exited soft subsp... Aimed at the issue that traditional clustering methods are not appropriate to high-dimensional data, a cuckoo search fuzzy-weighting algorithm for subspace clustering is presented on the basis of the exited soft subspace clustering algorithm. In the proposed algorithm, a novel objective function is firstly designed by considering the fuzzy weighting within-cluster compactness and the between-cluster separation, and loosening the constraints of dimension weight matrix. Then gradual membership and improved Cuckoo search, a global search strategy, are introduced to optimize the objective function and search subspace clusters, giving novel learning rules for clustering. At last, the performance of the proposed algorithm on the clustering analysis of various low and high dimensional datasets is experimentally compared with that of several competitive subspace clustering algorithms. Experimental studies demonstrate that the proposed algorithm can obtain better performance than most of the existing soft subspace clustering algorithms. 展开更多
关键词 high-dimensional data CLUSTERING soft SUBSPACE CUCKOO SEARCH FUZZY CLUSTERING
下载PDF
Variance Estimation for High-Dimensional Varying Index Coefficient Models
10
作者 Miao Wang Hao Lv Yicun Wang 《Open Journal of Statistics》 2019年第5期555-570,共16页
This paper studies the re-adjusted cross-validation method and a semiparametric regression model called the varying index coefficient model. We use the profile spline modal estimator method to estimate the coefficient... This paper studies the re-adjusted cross-validation method and a semiparametric regression model called the varying index coefficient model. We use the profile spline modal estimator method to estimate the coefficients of the parameter part of the Varying Index Coefficient Model (VICM), while the unknown function part uses the B-spline to expand. Moreover, we combine the above two estimation methods under the assumption of high-dimensional data. The results of data simulation and empirical analysis show that for the varying index coefficient model, the re-adjusted cross-validation method is better in terms of accuracy and stability than traditional methods based on ordinary least squares. 展开更多
关键词 high-dimensional data Refitted Cross-Validation VARYING INDEX COEFFICIENT MODELS Variance ESTIMATION
下载PDF
Outlier Detection Model Based on Autoencoder and Data Augmentation for High-Dimensional Sparse Data
11
作者 Haitao Zhang Wenhai Ma +1 位作者 Qilong Han Zhiqiang Ma 《国际计算机前沿大会会议论文集》 EI 2023年第1期192-206,共15页
This paper aims to address the problems of data imbalance,parame-ter adjustment complexity,and low accuracy in high-dimensional data anomaly detection.To address these issues,an autoencoder and data augmentation-based... This paper aims to address the problems of data imbalance,parame-ter adjustment complexity,and low accuracy in high-dimensional data anomaly detection.To address these issues,an autoencoder and data augmentation-based anomaly detection model for high-dimensional sparse data is proposed(SEAOD).First,the model solves the problem of imbalanced data by using the weighted SMOTE algorithm and ENN algorithm tofill in the minority class samples and generate a new dataset.Then,an attention mechanism is employed to calculate the feature similarity and determine the structure of the neural network so that the model can learn the data features.Finally,the data are dimensionally reduced based on the autoencoder,and the sparse high-dimensional data are mapped to a low-dimensional space for anomaly detection,overcoming the impact of the curse of dimensionality on detection algorithms.The experimental results show that on 15 public datasets,this model outperforms other comparison algorithms.Furthermore,it was validated on industrial air quality datasets and achieved the expected results with practicality. 展开更多
关键词 high-dimensional data augmentation attention mechanism Outlier Detection
原文传递
Effects of subsampling on characteristics of RNA-seq data from triple-negative breast cancer patients
12
作者 Alexey Stupnikov Galina V Glazko Frank Emmert-Streib 《Chinese Journal of Cancer》 SCIE CAS CSCD 2015年第10期427-438,共12页
Background:Data from RNA-seq experiments provide a wealth of information about the transcriptome of an organism.However,the analysis of such data is very demanding.In this study,we aimed to establish robust analysis p... Background:Data from RNA-seq experiments provide a wealth of information about the transcriptome of an organism.However,the analysis of such data is very demanding.In this study,we aimed to establish robust analysis procedures that can be used in clinical practice.Methods:We studied RNA-seq data from triple-negative breast cancer patients.Specifically,we investigated the subsampling of RNA-seq data.Results:The main results of our investigations are as follows:(1) the subsampling of RNA-seq data gave biologically realistic simulations of sequencing experiments with smaller sequencing depth but not direct scaling of count matrices;(2) the saturation of results required an average sequencing depth larger than 32 million reads and an individual sequencing depth larger than 46 million reads;and(3) for an abrogated feature selection,higher moments of the distribution of all expressed genes had a higher sensitivity for signal detection than the corresponding mean values.Conclusions:Our results reveal important characteristics of RNA-seq data that must be understood before one can apply such an approach to translational medicine. 展开更多
关键词 RNA-SEQ data Computational genomics Statistical robustness high-dimensional biology Triple-negative breast cancer
下载PDF
Data-Aware Partitioning Schema in MapReduce
13
作者 Liang Junjie Liu Qiongni +1 位作者 Yin Li Yu Dunhui 《国际计算机前沿大会会议论文集》 2015年第1期28-29,共2页
With the advantages of MapReduce programming model in parallel computing and processing of data and tasks on large-scale clusters, a Dataaware partitioning schema in MapReduce for large-scale high-dimensional data is ... With the advantages of MapReduce programming model in parallel computing and processing of data and tasks on large-scale clusters, a Dataaware partitioning schema in MapReduce for large-scale high-dimensional data is proposed. It optimizes partition method of data blocks with the same contribution to computation in MapReduce. Using a two-stage data partitioning strategy, the data are uniformly distributed into data blocks by clustering and partitioning. The experiments show that the data-aware partitioning schema is very effective and extensible for improving the query efficiency of highdimensional data. 展开更多
关键词 CLOUD COMPUTING MAPREDUCE high-dimensional data dataaware partitioning
下载PDF
Differentially private high-dimensional data publication via grouping and truncating techniques 被引量:3
14
作者 Ning WANG Yu GU +2 位作者 Jia XU Fangfang LI Ge YU 《Frontiers of Computer Science》 SCIE EI CSCD 2019年第2期382-395,共14页
The count of one column for high-dimensional datasets, i.e., the number of records containing this column, has been widely used in nuinerous applications such as analyzing popular spots based on check-in location info... The count of one column for high-dimensional datasets, i.e., the number of records containing this column, has been widely used in nuinerous applications such as analyzing popular spots based on check-in location information and mining valuable items from shopping records. However, this poses a privacy threat when directly publishing this information. Differential privacy (DP), as a notable paradigm for strong privacy guarantees, is thereby adopted to publish all column counts. Prior studies have verified that truncating records or grouping columns can effectively improve the accuracy of published results. To leverage the advantages of the two techniques, we combine these studies to further boost the accuracy of published results. However, the traditional penalty function, which measures the error imported by a given pair of parameters including truncating length and group size, is so sensitive that the derived parameters deviate from the optimal parameters significantly. To output preferable parameters, we first design a smart penalty function that is less sensitive than the traditional function. Moreover, a two-phase selection method is proposed to compute these parameters efficiently, together with the improvement in accuracy. Extensive experiments on a broad spectrum of real-world datasets validate the effectiveness of our proposals. 展开更多
关键词 differential privacy high-dimensional data TRUNCATION optimization GROUPING PENALTY function
原文传递
High-Dimensional Volatility Matrix Estimation with Cross-Sectional Dependent and Heavy-Tailed Microstructural Noise
15
作者 LIANG Wanwan WU Ben +2 位作者 FAN Xinyan JING Bingyi ZHANG Bo 《Journal of Systems Science & Complexity》 SCIE EI CSCD 2023年第5期2125-2154,共30页
The estimates of the high-dimensional volatility matrix based on high-frequency data play a pivotal role in many financial applications.However,most existing studies have been built on the sub-Gaussian and cross-secti... The estimates of the high-dimensional volatility matrix based on high-frequency data play a pivotal role in many financial applications.However,most existing studies have been built on the sub-Gaussian and cross-sectional independence assumptions of microstructure noise,which are typically violated in the financial markets.In this paper,the authors proposed a new robust volatility matrix estimator,with very mild assumptions on the cross-sectional dependence and tail behaviors of the noises,and demonstrated that it can achieve the optimal convergence rate n-1/4.Furthermore,the proposed model offered better explanatory and predictive powers by decomposing the estimator into low-rank and sparse components,using an appropriate regularization procedure.Simulation studies demonstrated that the proposed estimator outperforms its competitors under various dependence structures of microstructure noise.Additionally,an extensive analysis of the high-frequency data for stocks in the Shenzhen Stock Exchange of China demonstrated the practical effectiveness of the estimator. 展开更多
关键词 Cross-sectional dependence high-dimensional data high-frequency data integrated volatility matrix market microstructure noise
原文传递
An Ant Colony Optimization Based Dimension Reduction Method for High-Dimensional Datasets 被引量:3
16
作者 Ying Li Gang Wang +2 位作者 Huiling Chen Lian Shi Lei Qin 《Journal of Bionic Engineering》 SCIE EI CSCD 2013年第2期231-241,共11页
In this paper, a bionic optimization algorithm based dimension reduction method named Ant Colony Optimization -Selection (ACO-S) is proposed for high-dimensional datasets. Because microarray datasets comprise tens o... In this paper, a bionic optimization algorithm based dimension reduction method named Ant Colony Optimization -Selection (ACO-S) is proposed for high-dimensional datasets. Because microarray datasets comprise tens of thousands of features (genes), they are usually used to test the dimension reduction techniques. ACO-S consists of two stages in which two well-known ACO algorithms, namely ant system and ant colony system, are utilized to seek for genes, respectively. In the first stage, a modified ant system is used to filter the nonsignificant genes from high-dimensional space, and a number of promising genes are reserved in the next step. In the second stage, an improved ant colony system is applied to gene selection. In order to enhance the search ability of ACOs, we propose a method for calculating priori available heuristic information and design a fuzzy logic controller to dynamically adjust the number of ants in ant colony system. Furthermore, we devise another fuzzy logic controller to tune the parameter (q0) in ant colony system. We evaluate the performance of ACO-S on five microarray datasets, which have dimensions varying from 7129 to 12000. We also compare the performance of ACO-S with the results obtained from four existing well-known bionic optimization algorithms. The comparison results show that ACO-S has a notable ability to" generate a gene subset with the smallest size and salient features while yielding high classification accuracy. The comparative results generated by ACO-S adopting different classifiers are also given. The proposed method is shown to be a promising and effective tool for mining high-dimension data and mobile robot navigation. 展开更多
关键词 gene selection feature selection ant colony optimization high-dimensional data
原文传递
Average Estimation of Semiparametric Models for High-Dimensional Longitudinal Data 被引量:2
17
作者 ZHAO Zhihao ZOU Guohua 《Journal of Systems Science & Complexity》 SCIE EI CSCD 2020年第6期2013-2047,共35页
Model average receives much attention in recent years.This paper considers the semiparametric model averaging for high-dimensional longitudinal data.To minimize the prediction error,the authors estimate the model weig... Model average receives much attention in recent years.This paper considers the semiparametric model averaging for high-dimensional longitudinal data.To minimize the prediction error,the authors estimate the model weights using a leave-subject-out cross-validation procedure.Asymptotic optimality of the proposed method is proved in the sense that leave-subject-out cross-validation achieves the lowest possible prediction loss asymptotically.Simulation studies show that the performance of the proposed model average method is much better than that of some commonly used model selection and averaging methods. 展开更多
关键词 Asymptotic optimality high-dimensional longitudinal data leave-subject-out cross-validation model averaging semiparametric models
原文传递
On the k-sample Behrens-Fisher problem for high-dimensional data 被引量:3
18
作者 ZHANG JinTing XU JinFeng 《Science China Mathematics》 SCIE 2009年第6期1285-1304,共20页
For several decades,much attention has been paid to the two-sample Behrens-Fisher(BF) problem which tests the equality of the means or mean vectors of two normal populations with unequal variance/covariance structures... For several decades,much attention has been paid to the two-sample Behrens-Fisher(BF) problem which tests the equality of the means or mean vectors of two normal populations with unequal variance/covariance structures.Little work,however,has been done for the k-sample BF problem for high dimensional data which tests the equality of the mean vectors of several high-dimensional normal populations with unequal covariance structures.In this paper we study this challenging problem via extending the famous Scheffe's transformation method,which reduces the k-sample BF problem to a one-sample problem.The induced one-sample problem can be easily tested by the classical Hotelling's T 2 test when the size of the resulting sample is very large relative to its dimensionality.For high dimensional data,however,the dimensionality of the resulting sample is often very large,and even much larger than its sample size,which makes the classical Hotelling's T 2 test not powerful or not even well defined.To overcome this diffculty,we propose and study an L2-norm based test.The asymp-totic powers of the proposed L2-norm based test and Hotelling's T 2 test are derived and theoretically compared.Methods for implementing the L2-norm based test are described.Simulation studies are conducted to compare the L2-norm based test and Hotelling's T 2 test when the latter can be well defined,and to compare the proposed implementation methods for the L2-norm based test otherwise.The methodologies are motivated and illustrated by a real data example. 展开更多
关键词 χ~2-approximation χ~2-type MIXTURES high-dimensional data analysis Hotelling’s T^2 TEST k-sample TEST L^2-norm based TEST
原文传递
Visualizing large-scale high-dimensional data via hierarchical embedding of KNN graphs 被引量:2
19
作者 Haiyang Zhu Minfeng Zhu +5 位作者 Yingchaojie Feng Deng Cai Yuanzhe Hu Shilong Wu Xiangyang Wu Wei Chen 《Visual Informatics》 EI 2021年第2期51-59,共9页
Visualizing intrinsic structures of high-dimensional data is an essential task in data analysis.Over the past decades,a large number of methods have been proposed.Among all solutions,one promising way for enabling eff... Visualizing intrinsic structures of high-dimensional data is an essential task in data analysis.Over the past decades,a large number of methods have been proposed.Among all solutions,one promising way for enabling effective visual exploration is to construct a k-nearest neighbor(KNN)graph and visualize the graph in a low-dimensional space.Yet,state-of-the-art methods such as the LargeVis still suffer from two main problems when applied to large-scale data:(1)they may produce unappealing visualizations due to the non-convexity of the cost function;(2)visualizing the KNN graph is still time-consuming.In this work,we propose a novel visualization algorithm that leverages a multilevel representation to achieve a high-quality graph layout and employs a cluster-based approximation scheme to accelerate the KNN graph layout.Experiments on various large-scale datasets indicate that our approach achieves a speedup by a factor of five for KNN graph visualization compared to LargeVis and yields aesthetically pleasing visualization results. 展开更多
关键词 high-dimensional data visualization KNN graph Graph visualization
原文传递
Sampling with Prior Knowledge for High-dimensional Gravitational Wave Data Analysis 被引量:2
20
作者 He Wang Zhoujian Cao +2 位作者 Yue Zhou Zong-Kuan Guo Zhixiang Ren 《Big Data Mining and Analytics》 EI 2022年第1期53-63,共11页
Extracting knowledge from high-dimensional data has been notoriously difficult,primarily due to the so-called"curse of dimensionality"and the complex joint distributions of these dimensions.This is a particu... Extracting knowledge from high-dimensional data has been notoriously difficult,primarily due to the so-called"curse of dimensionality"and the complex joint distributions of these dimensions.This is a particularly profound issue for high-dimensional gravitational wave data analysis where one requires to conduct Bayesian inference and estimate joint posterior distributions.In this study,we incorporate prior physical knowledge by sampling from desired interim distributions to develop the training dataset.Accordingly,the more relevant regions of the high-dimensional feature space are covered by additional data points,such that the model can learn the subtle but important details.We adapt the normalizing flow method to be more expressive and trainable,such that the information can be effectively extracted and represented by the transformation between the prior and target distributions.Once trained,our model only takes approximately 1 s on one V100 GPU to generate thousands of samples for probabilistic inference purposes.The evaluation of our approach confirms the efficacy and efficiency of gravitational wave data inferences and points to a promising direction for similar research.The source code,specifications,and detailed procedures are publicly accessible on GitHub. 展开更多
关键词 high-dimensional data prior sampling normalizing flow gravitational wave
原文传递
上一页 1 2 3 下一页 到第
使用帮助 返回顶部