期刊文献+
共找到58篇文章
< 1 2 3 >
每页显示 20 50 100
Optimal Estimation of High-Dimensional Covariance Matrices with Missing and Noisy Data
1
作者 Meiyin Wang Wanzhou Ye 《Advances in Pure Mathematics》 2024年第4期214-227,共14页
The estimation of covariance matrices is very important in many fields, such as statistics. In real applications, data are frequently influenced by high dimensions and noise. However, most relevant studies are based o... The estimation of covariance matrices is very important in many fields, such as statistics. In real applications, data are frequently influenced by high dimensions and noise. However, most relevant studies are based on complete data. This paper studies the optimal estimation of high-dimensional covariance matrices based on missing and noisy sample under the norm. First, the model with sub-Gaussian additive noise is presented. The generalized sample covariance is then modified to define a hard thresholding estimator , and the minimax upper bound is derived. After that, the minimax lower bound is derived, and it is concluded that the estimator presented in this article is rate-optimal. Finally, numerical simulation analysis is performed. The result shows that for missing samples with sub-Gaussian noise, if the true covariance matrix is sparse, the hard thresholding estimator outperforms the traditional estimate method. 展开更多
关键词 high-dimensional Covariance Matrix Missing data Sub-Gaussian Noise Optimal Estimation
下载PDF
Observation points classifier ensemble for high-dimensional imbalanced classification 被引量:1
2
作者 Yulin He Xu Li +3 位作者 Philippe Fournier‐Viger Joshua Zhexue Huang Mianjie Li Salman Salloum 《CAAI Transactions on Intelligence Technology》 SCIE EI 2023年第2期500-517,共18页
In this paper,an Observation Points Classifier Ensemble(OPCE)algorithm is proposed to deal with High-Dimensional Imbalanced Classification(HDIC)problems based on data processed using the Multi-Dimensional Scaling(MDS)... In this paper,an Observation Points Classifier Ensemble(OPCE)algorithm is proposed to deal with High-Dimensional Imbalanced Classification(HDIC)problems based on data processed using the Multi-Dimensional Scaling(MDS)feature extraction technique.First,dimensionality of the original imbalanced data is reduced using MDS so that distances between any two different samples are preserved as well as possible.Second,a novel OPCE algorithm is applied to classify imbalanced samples by placing optimised observation points in a low-dimensional data space.Third,optimization of the observation point mappings is carried out to obtain a reliable assessment of the unknown samples.Exhaustive experiments have been conducted to evaluate the feasibility,rationality,and effectiveness of the proposed OPCE algorithm using seven benchmark HDIC data sets.Experimental results show that(1)the OPCE algorithm can be trained faster on low-dimensional imbalanced data than on high-dimensional data;(2)the OPCE algorithm can correctly identify samples as the number of optimised observation points is increased;and(3)statistical analysis reveals that OPCE yields better HDIC performances on the selected data sets in comparison with eight other HDIC algorithms.This demonstrates that OPCE is a viable algorithm to deal with HDIC problems. 展开更多
关键词 classifier ensemble feature transformation high-dimensional data classification imbalanced learning observation point mechanism
下载PDF
Mapping winter wheat using phenological feature of peak before winter on the North China Plain based on time-series MODIS data 被引量:16
3
作者 TAO Jian-bin WU Wen-bin +2 位作者 ZHOU Yong WANG Yu JIANG Yan 《Journal of Integrative Agriculture》 SCIE CAS CSCD 2017年第2期348-359,共12页
By employing the unique phenological feature of winter wheat extracted from peak before winter (PBW) and the advantages of moderate resolution imaging spectroradiometer (MODIS) data with high temporal resolution a... By employing the unique phenological feature of winter wheat extracted from peak before winter (PBW) and the advantages of moderate resolution imaging spectroradiometer (MODIS) data with high temporal resolution and intermediate spatial resolution, a remote sensing-based model for mapping winter wheat on the North China Plain was built through integration with Landsat images and land-use data. First, a phenological window, PBW was drawn from time-series MODIS data. Next, feature extraction was performed for the PBW to reduce feature dimension and enhance its information. Finally, a regression model was built to model the relationship of the phenological feature and the sample data. The amount of information of the PBW was evaluated and compared with that of the main peak (MP). The relative precision of the mapping reached up to 92% in comparison to the Landsat sample data, and ranged between 87 and 96% in comparison to the statistical data. These results were sufficient to satisfy the accuracy requirements for winter wheat mapping at a large scale. Moreover, the proposed method has the ability to obtain the distribution information for winter wheat in an earlier period than previous studies. This study could throw light on the monitoring of winter wheat in China by using unique phenological feature of winter wheat. 展开更多
关键词 time-series MODIS data phenological feature peak before wintering winter wheat mapping
下载PDF
Clustering Structure Analysis in Time-Series Data With Density-Based Clusterability Measure 被引量:6
4
作者 Juho Jokinen Tomi Raty Timo Lintonen 《IEEE/CAA Journal of Automatica Sinica》 SCIE EI CSCD 2019年第6期1332-1343,共12页
Clustering is used to gain an intuition of the struc tures in the data.Most of the current clustering algorithms pro duce a clustering structure even on data that do not possess such structure.In these cases,the algor... Clustering is used to gain an intuition of the struc tures in the data.Most of the current clustering algorithms pro duce a clustering structure even on data that do not possess such structure.In these cases,the algorithms force a structure in the data instead of discovering one.To avoid false structures in the relations of data,a novel clusterability assessment method called density-based clusterability measure is proposed in this paper.I measures the prominence of clustering structure in the data to evaluate whether a cluster analysis could produce a meaningfu insight to the relationships in the data.This is especially useful in time-series data since visualizing the structure in time-series data is hard.The performance of the clusterability measure is evalu ated against several synthetic data sets and time-series data sets which illustrate that the density-based clusterability measure can successfully indicate clustering structure of time-series data. 展开更多
关键词 CLUSTERING EXPLORATORY data analysis time-series UNSUPERVISED LEARNING
下载PDF
A Length-Adaptive Non-Dominated Sorting Genetic Algorithm for Bi-Objective High-Dimensional Feature Selection
5
作者 Yanlu Gong Junhai Zhou +2 位作者 Quanwang Wu MengChu Zhou Junhao Wen 《IEEE/CAA Journal of Automatica Sinica》 SCIE EI CSCD 2023年第9期1834-1844,共11页
As a crucial data preprocessing method in data mining,feature selection(FS)can be regarded as a bi-objective optimization problem that aims to maximize classification accuracy and minimize the number of selected featu... As a crucial data preprocessing method in data mining,feature selection(FS)can be regarded as a bi-objective optimization problem that aims to maximize classification accuracy and minimize the number of selected features.Evolutionary computing(EC)is promising for FS owing to its powerful search capability.However,in traditional EC-based methods,feature subsets are represented via a length-fixed individual encoding.It is ineffective for high-dimensional data,because it results in a huge search space and prohibitive training time.This work proposes a length-adaptive non-dominated sorting genetic algorithm(LA-NSGA)with a length-variable individual encoding and a length-adaptive evolution mechanism for bi-objective highdimensional FS.In LA-NSGA,an initialization method based on correlation and redundancy is devised to initialize individuals of diverse lengths,and a Pareto dominance-based length change operator is introduced to guide individuals to explore in promising search space adaptively.Moreover,a dominance-based local search method is employed for further improvement.The experimental results based on 12 high-dimensional gene datasets show that the Pareto front of feature subsets produced by LA-NSGA is superior to those of existing algorithms. 展开更多
关键词 Bi-objective optimization feature selection(FS) genetic algorithm high-dimensional data length-adaptive
下载PDF
Similarity measurement method of high-dimensional data based on normalized net lattice subspace 被引量:4
6
作者 李文法 Wang Gongming +1 位作者 Li Ke Huang Su 《High Technology Letters》 EI CAS 2017年第2期179-184,共6页
The performance of conventional similarity measurement methods is affected seriously by the curse of dimensionality of high-dimensional data.The reason is that data difference between sparse and noisy dimensionalities... The performance of conventional similarity measurement methods is affected seriously by the curse of dimensionality of high-dimensional data.The reason is that data difference between sparse and noisy dimensionalities occupies a large proportion of the similarity,leading to the dissimilarities between any results.A similarity measurement method of high-dimensional data based on normalized net lattice subspace is proposed.The data range of each dimension is divided into several intervals,and the components in different dimensions are mapped onto the corresponding interval.Only the component in the same or adjacent interval is used to calculate the similarity.To validate this method,three data types are used,and seven common similarity measurement methods are compared.The experimental result indicates that the relative difference of the method is increasing with the dimensionality and is approximately two or three orders of magnitude higher than the conventional method.In addition,the similarity range of this method in different dimensions is [0,1],which is fit for similarity analysis after dimensionality reduction. 展开更多
关键词 high-dimensional data the curse of dimensionality SIMILARITY NORMALIZATION SUBSPACE NPsim
下载PDF
Spatio-temporal changes of underground coal fires during 2008-2016 in Khanh Hoa coal field(North-east of Viet Nam) using Landsat time-series data 被引量:2
7
作者 Tuyen Danh VU Thanh Tien NGUYEN 《Journal of Mountain Science》 SCIE CSCD 2018年第12期2703-2720,共18页
Underground coal fires are one of the most common and serious geohazards in most coal producing countries in the world. Monitoring their spatio-temporal changes plays an important role in controlling and preventing th... Underground coal fires are one of the most common and serious geohazards in most coal producing countries in the world. Monitoring their spatio-temporal changes plays an important role in controlling and preventing the effects of coal fires, and their environmental impact. In this study, the spatio-temporal changes of underground coal fires in Khanh Hoa coal field(North-East of Viet Nam) were analyzed using Landsat time-series data during the 2008-2016 period. Based on land surface temperatures retrieved from Landsat thermal data, underground coal fires related to thermal anomalies were identified using the MEDIAN+1.5×IQR(IQR: Interquartile range) threshold technique. The locations of underground coal fires were validated using a coal fire map produced by the field survey data and cross-validated using the daytime ASTER thermal infrared imagery. Based on the fires extracted from seven Landsat thermal imageries, the spatiotemporal changes of underground coal fire areas were analyzed. The results showed that the thermalanomalous zones have been correlated with known coal fires. Cross-validation of coal fires using ASTER TIR data showed a high consistency of 79.3%. The largest coal fire area of 184.6 hectares was detected in 2010, followed by 2014(181.1 hectares) and 2016(178.5 hectares). The smaller coal fire areas were extracted with areas of 133.6 and 152.5 hectares in 2011 and 2009 respectively. Underground coal fires were mainly detected in the northern and southern part, and tend to spread to north-west of the coal field. 展开更多
关键词 UNDERGROUND COAL fires SPATIO-TEMPORAL CHANGES Khanh Hoa COAL field (Viet Nam) LANDSAT time-series data
下载PDF
Classification of Vegetation in North Tibet Plateau Based on MODIS Time-Series Data 被引量:1
8
作者 LU Yuan YAN Yan TAO Heping 《Wuhan University Journal of Natural Sciences》 CAS 2008年第3期273-278,共6页
Based on the 16d-composite MODIS (moderate resolution imaging spectroradiometer)-NDVI(normalized difference vegetation index) time-series data in 2004, vegetation in North Tibet Plateau was classified and seasonal... Based on the 16d-composite MODIS (moderate resolution imaging spectroradiometer)-NDVI(normalized difference vegetation index) time-series data in 2004, vegetation in North Tibet Plateau was classified and seasonal variations on the pixels selected from different vegetation type were analyzed. The Savitzky-Golay filtering algorithm was applied to perform a filtration processing for MODIS-NDVI time-series data. The processed time-series curves can reflect a real variation trend of vegetation growth. The NDVI time-series curves of coniferous forest, high-cold meadow, high-cold meadow steppe and high-cold steppe all appear a mono-peak model during vegetation growth with the maximum peak occurring in August. A decision-tree classification model was established according to either NDVI time-series data or land surface temperature data. And then, both classifying and processing for vegetations were carried out through the model based on NDVI time-series curves. An accuracy test illustrates that classification results are of high accuracy and credibility and the model is conducive for studying a climate variation and estimating a vegetation production at regional even global scale. 展开更多
关键词 vegetation classification moderate resolution imaging spectroradiometer normalized difference vegetation index time-series data North Tibet Plateau
下载PDF
Dimensionality Reduction of High-Dimensional Highly Correlated Multivariate Grapevine Dataset
9
作者 Uday Kant Jha Peter Bajorski +3 位作者 Ernest Fokoue Justine Vanden Heuvel Jan van Aardt Grant Anderson 《Open Journal of Statistics》 2017年第4期702-717,共16页
Viticulturists traditionally have a keen interest in studying the relationship between the biochemistry of grapevines’ leaves/petioles and their associated spectral reflectance in order to understand the fruit ripeni... Viticulturists traditionally have a keen interest in studying the relationship between the biochemistry of grapevines’ leaves/petioles and their associated spectral reflectance in order to understand the fruit ripening rate, water status, nutrient levels, and disease risk. In this paper, we implement imaging spectroscopy (hyperspectral) reflectance data, for the reflective 330 - 2510 nm wavelength region (986 total spectral bands), to assess vineyard nutrient status;this constitutes a high dimensional dataset with a covariance matrix that is ill-conditioned. The identification of the variables (wavelength bands) that contribute useful information for nutrient assessment and prediction, plays a pivotal role in multivariate statistical modeling. In recent years, researchers have successfully developed many continuous, nearly unbiased, sparse and accurate variable selection methods to overcome this problem. This paper compares four regularized and one functional regression methods: Elastic Net, Multi-Step Adaptive Elastic Net, Minimax Concave Penalty, iterative Sure Independence Screening, and Functional Data Analysis for wavelength variable selection. Thereafter, the predictive performance of these regularized sparse models is enhanced using the stepwise regression. This comparative study of regression methods using a high-dimensional and highly correlated grapevine hyperspectral dataset revealed that the performance of Elastic Net for variable selection yields the best predictive ability. 展开更多
关键词 high-dimensional data MULTI-STEP Adaptive Elastic Net MINIMAX CONCAVE Penalty Sure Independence Screening Functional data Analysis
下载PDF
Making Short-term High-dimensional Data Predictable
10
作者 CHEN Luonan 《Bulletin of the Chinese Academy of Sciences》 2018年第4期243-244,共2页
Making accurate forecast or prediction is a challenging task in the big data era, in particular for those datasets involving high-dimensional variables but short-term time series points,which are generally available f... Making accurate forecast or prediction is a challenging task in the big data era, in particular for those datasets involving high-dimensional variables but short-term time series points,which are generally available from real-world systems.To address this issue, Prof. 展开更多
关键词 RDE MAKING SHORT-TERM high-dimensional data Predictable
下载PDF
Data-driven Surrogate-assisted Method for High-dimensional Multi-area Combined Economic/Emission Dispatch
11
作者 Chenhao Lin Huijun Liang +2 位作者 Aokang Pang Jianwei Zhong Yongchao Yang 《Journal of Modern Power Systems and Clean Energy》 SCIE EI CSCD 2024年第1期52-64,共13页
Multi-area combined economic/emission dispatch(MACEED)problems are generally studied using analytical functions.However,as the scale of power systems increases,ex isting solutions become time-consuming and may not mee... Multi-area combined economic/emission dispatch(MACEED)problems are generally studied using analytical functions.However,as the scale of power systems increases,ex isting solutions become time-consuming and may not meet oper ational constraints.To overcome excessive computational ex pense in high-dimensional MACEED problems,a novel data-driven surrogate-assisted method is proposed.First,a cosine-similarity-based deep belief network combined with a back-propagation(DBN+BP)neural network is utilized to replace cost and emission functions.Second,transfer learning is applied with a pretraining and fine-tuning method to improve DBN+BP regression surrogate models,thus realizing fast con struction of surrogate models between different regional power systems.Third,a multi-objective antlion optimizer with a novel general single-dimension retention bi-objective optimization poli cy is proposed to execute MACEED optimization to obtain scheduling decisions.The proposed method not only ensures the convergence,uniformity,and extensibility of the Pareto front,but also greatly reduces the computational time.Finally,a 4-ar ea 40-unit test system with different constraints is employed to demonstrate the effectiveness of the proposed method. 展开更多
关键词 Multi-area combined economic/emission dispatch high-dimensional power system deep belief network data driven transfer learning
原文传递
Randomized Latent Factor Model for High-dimensional and Sparse Matrices from Industrial Applications 被引量:13
12
作者 Mingsheng Shang Xin Luo +3 位作者 Zhigang Liu Jia Chen Ye Yuan MengChu Zhou 《IEEE/CAA Journal of Automatica Sinica》 EI CSCD 2019年第1期131-141,共11页
Latent factor(LF) models are highly effective in extracting useful knowledge from High-Dimensional and Sparse(HiDS) matrices which are commonly seen in various industrial applications. An LF model usually adopts itera... Latent factor(LF) models are highly effective in extracting useful knowledge from High-Dimensional and Sparse(HiDS) matrices which are commonly seen in various industrial applications. An LF model usually adopts iterative optimizers,which may consume many iterations to achieve a local optima,resulting in considerable time cost. Hence, determining how to accelerate the training process for LF models has become a significant issue. To address this, this work proposes a randomized latent factor(RLF) model. It incorporates the principle of randomized learning techniques from neural networks into the LF analysis of HiDS matrices, thereby greatly alleviating computational burden. It also extends a standard learning process for randomized neural networks in context of LF analysis to make the resulting model represent an HiDS matrix correctly.Experimental results on three HiDS matrices from industrial applications demonstrate that compared with state-of-the-art LF models, RLF is able to achieve significantly higher computational efficiency and comparable prediction accuracy for missing data.I provides an important alternative approach to LF analysis of HiDS matrices, which is especially desired for industrial applications demanding highly efficient models. 展开更多
关键词 Big data high-dimensional and sparse matrix latent factor analysis latent factor model randomized learning
下载PDF
CSFW-SC: Cuckoo Search Fuzzy-Weighting Algorithm for Subspace Clustering Applying to High-Dimensional Clustering 被引量:1
13
作者 WANG Jindong HE Jiajing +1 位作者 ZHANG Hengwei YU Zhiyong 《China Communications》 SCIE CSCD 2015年第S2期55-63,共9页
Aimed at the issue that traditional clustering methods are not appropriate to high-dimensional data, a cuckoo search fuzzy-weighting algorithm for subspace clustering is presented on the basis of the exited soft subsp... Aimed at the issue that traditional clustering methods are not appropriate to high-dimensional data, a cuckoo search fuzzy-weighting algorithm for subspace clustering is presented on the basis of the exited soft subspace clustering algorithm. In the proposed algorithm, a novel objective function is firstly designed by considering the fuzzy weighting within-cluster compactness and the between-cluster separation, and loosening the constraints of dimension weight matrix. Then gradual membership and improved Cuckoo search, a global search strategy, are introduced to optimize the objective function and search subspace clusters, giving novel learning rules for clustering. At last, the performance of the proposed algorithm on the clustering analysis of various low and high dimensional datasets is experimentally compared with that of several competitive subspace clustering algorithms. Experimental studies demonstrate that the proposed algorithm can obtain better performance than most of the existing soft subspace clustering algorithms. 展开更多
关键词 high-dimensional data CLUSTERING soft SUBSPACE CUCKOO SEARCH FUZZY CLUSTERING
下载PDF
Variance Estimation for High-Dimensional Varying Index Coefficient Models
14
作者 Miao Wang Hao Lv Yicun Wang 《Open Journal of Statistics》 2019年第5期555-570,共16页
This paper studies the re-adjusted cross-validation method and a semiparametric regression model called the varying index coefficient model. We use the profile spline modal estimator method to estimate the coefficient... This paper studies the re-adjusted cross-validation method and a semiparametric regression model called the varying index coefficient model. We use the profile spline modal estimator method to estimate the coefficients of the parameter part of the Varying Index Coefficient Model (VICM), while the unknown function part uses the B-spline to expand. Moreover, we combine the above two estimation methods under the assumption of high-dimensional data. The results of data simulation and empirical analysis show that for the varying index coefficient model, the re-adjusted cross-validation method is better in terms of accuracy and stability than traditional methods based on ordinary least squares. 展开更多
关键词 high-dimensional data Refitted Cross-Validation VARYING INDEX COEFFICIENT MODELS Variance ESTIMATION
下载PDF
HXPY: A High-Performance Data Processing Package for Financial Time-Series Data
15
作者 郭家栋 彭靖姝 +1 位作者 苑航 倪明选 《Journal of Computer Science & Technology》 SCIE EI CSCD 2023年第1期3-24,共22页
A tremendous amount of data has been generated by global financial markets everyday,and such time-series data needs to be analyzed in real time to explore its potential value.In recent years,we have witnessed the succ... A tremendous amount of data has been generated by global financial markets everyday,and such time-series data needs to be analyzed in real time to explore its potential value.In recent years,we have witnessed the successful adoption of machine learning models on financial data,where the importance of accuracy and timeliness demands highly effective computing frameworks.However,traditional financial time-series data processing frameworks have shown performance degradation and adaptation issues,such as the outlier handling with stock suspension in Pandas and TA-Lib.In this paper,we propose HXPY,a high-performance data processing package with a C++/Python interface for financial time-series data.HXPY supports miscellaneous acceleration techniques such as the streaming algorithm,the vectorization instruction set,and memory optimization,together with various functions such as time window functions,group operations,down-sampling operations,cross-section operations,row-wise or column-wise operations,shape transformations,and alignment functions.The results of benchmark and incremental analysis demonstrate the superior performance of HXPY compared with its counterparts.From MiBs to GiBs data,HXPY significantly outperforms other in-memory dataframe computing rivals even up to hundreds of times. 展开更多
关键词 dataframe time-series data SIMD(single instruction multiple data) CUDA(Compute Unified Device Architecture)
原文传递
Outlier Detection Model Based on Autoencoder and Data Augmentation for High-Dimensional Sparse Data
16
作者 Haitao Zhang Wenhai Ma +1 位作者 Qilong Han Zhiqiang Ma 《国际计算机前沿大会会议论文集》 EI 2023年第1期192-206,共15页
This paper aims to address the problems of data imbalance,parame-ter adjustment complexity,and low accuracy in high-dimensional data anomaly detection.To address these issues,an autoencoder and data augmentation-based... This paper aims to address the problems of data imbalance,parame-ter adjustment complexity,and low accuracy in high-dimensional data anomaly detection.To address these issues,an autoencoder and data augmentation-based anomaly detection model for high-dimensional sparse data is proposed(SEAOD).First,the model solves the problem of imbalanced data by using the weighted SMOTE algorithm and ENN algorithm tofill in the minority class samples and generate a new dataset.Then,an attention mechanism is employed to calculate the feature similarity and determine the structure of the neural network so that the model can learn the data features.Finally,the data are dimensionally reduced based on the autoencoder,and the sparse high-dimensional data are mapped to a low-dimensional space for anomaly detection,overcoming the impact of the curse of dimensionality on detection algorithms.The experimental results show that on 15 public datasets,this model outperforms other comparison algorithms.Furthermore,it was validated on industrial air quality datasets and achieved the expected results with practicality. 展开更多
关键词 high-dimensional data augmentation attention mechanism Outlier Detection
原文传递
Fusing multi-source data to map spatio-temporal dynamics of winter rape on the Jianghan Plain and Dongting Lake Plain, China 被引量:1
17
作者 TAO Jian-bin LIU Wen-bin +2 位作者 TAN Wen-xia KONG Xiang-bing XU Meng 《Journal of Integrative Agriculture》 SCIE CAS CSCD 2019年第10期2393-2407,共15页
Mapping crop distribution with remote sensing data is of great importance for agricultural production, food security and agricultural sustainability. Winter rape is an important oil crop, which plays an important role... Mapping crop distribution with remote sensing data is of great importance for agricultural production, food security and agricultural sustainability. Winter rape is an important oil crop, which plays an important role in the cooking oil market of China. The Jianghan Plain and Dongting Lake Plain (JPDLP) are major agricultural production areas in China. Essential changes in winter rape distribution have taken place in this area during the 21st century. However, the pattern of these changes remains unknown. In this study, the spatial and temporal dynamics of winter rape from 2000 to 2017 on the JPDLP were analyzed. An artificial neural network (ANN)-based classification method was proposed to map fractional winter rape distribution by fusing moderate resolution imaging spectrometer (MODIS) data and high-resolution imagery. The results are as follows:(1) The total winter rape acreages on the JPDLP dropped significantly, especially on the Jianghan Plain with a decline of about 45% during 2000 and 2017.(2) The winter rape abundance keeps changing with about 20–30% croplands changing their abundance drastically in every two consecutive observation years.(3) The winter rape has obvious regional differentiation for the trend of its change at the county level, and the decreasing trend was observed more strongly in the traditionally dominant agricultural counties. 展开更多
关键词 WINTER rape spatio-temporal dynamics time-series MODIS data artificial NEURAL network
下载PDF
Effects of subsampling on characteristics of RNA-seq data from triple-negative breast cancer patients
18
作者 Alexey Stupnikov Galina V Glazko Frank Emmert-Streib 《Chinese Journal of Cancer》 SCIE CAS CSCD 2015年第10期427-438,共12页
Background:Data from RNA-seq experiments provide a wealth of information about the transcriptome of an organism.However,the analysis of such data is very demanding.In this study,we aimed to establish robust analysis p... Background:Data from RNA-seq experiments provide a wealth of information about the transcriptome of an organism.However,the analysis of such data is very demanding.In this study,we aimed to establish robust analysis procedures that can be used in clinical practice.Methods:We studied RNA-seq data from triple-negative breast cancer patients.Specifically,we investigated the subsampling of RNA-seq data.Results:The main results of our investigations are as follows:(1) the subsampling of RNA-seq data gave biologically realistic simulations of sequencing experiments with smaller sequencing depth but not direct scaling of count matrices;(2) the saturation of results required an average sequencing depth larger than 32 million reads and an individual sequencing depth larger than 46 million reads;and(3) for an abrogated feature selection,higher moments of the distribution of all expressed genes had a higher sensitivity for signal detection than the corresponding mean values.Conclusions:Our results reveal important characteristics of RNA-seq data that must be understood before one can apply such an approach to translational medicine. 展开更多
关键词 RNA-SEQ data Computational genomics Statistical robustness high-dimensional biology Triple-negative breast cancer
下载PDF
Data-Aware Partitioning Schema in MapReduce
19
作者 Liang Junjie Liu Qiongni +1 位作者 Yin Li Yu Dunhui 《国际计算机前沿大会会议论文集》 2015年第1期28-29,共2页
With the advantages of MapReduce programming model in parallel computing and processing of data and tasks on large-scale clusters, a Dataaware partitioning schema in MapReduce for large-scale high-dimensional data is ... With the advantages of MapReduce programming model in parallel computing and processing of data and tasks on large-scale clusters, a Dataaware partitioning schema in MapReduce for large-scale high-dimensional data is proposed. It optimizes partition method of data blocks with the same contribution to computation in MapReduce. Using a two-stage data partitioning strategy, the data are uniformly distributed into data blocks by clustering and partitioning. The experiments show that the data-aware partitioning schema is very effective and extensible for improving the query efficiency of highdimensional data. 展开更多
关键词 CLOUD COMPUTING MAPREDUCE high-dimensional data dataaware partitioning
下载PDF
Inter-hour direct normal irradiance forecast with multiple data types and time-series 被引量:6
20
作者 Tingting ZHU Hai ZHOU +3 位作者 Haikun WEI Xin ZHAO Kanjian ZHANG Jinxia ZHANG 《Journal of Modern Power Systems and Clean Energy》 SCIE EI CSCD 2019年第5期1319-1327,共9页
Boosted by a strong solar power market,the electricity grid is exposed to risk under an increasing share of fluctuant solar power.To increase the stability of the electricity grid,an accurate solar power forecast is n... Boosted by a strong solar power market,the electricity grid is exposed to risk under an increasing share of fluctuant solar power.To increase the stability of the electricity grid,an accurate solar power forecast is needed to evaluate such fluctuations.In terms of forecast,solar irradiance is the key factor of solar power generation,which is affected by atmospheric conditions,including surface meteorological variables and column integrated variables.These variables involve multiple numerical timeseries and images.However,few studies have focused on the processing method of multiple data types in an interhour direct normal irradiance(DNI)forecast.In this study,a framework for predicting the DNI for a 10-min time horizon was developed,which included the nondimensionalization of multiple data types and time-series,development of a forecast model,and transformation of the outputs.Several atmospheric variables were considered in the forecast framework,including the historical DNI,wind speed and direction,relative humidity time-series,and ground-based cloud images.Experiments were conducted to evaluate the performance of the forecast framework.The experimental results demonstrate that the proposed method performs well with a normalized mean bias error of 0.41%and a normalized root mean square error(n RMSE)of20.53%,and outperforms the persistent model with an improvement of 34%in the nRMSE. 展开更多
关键词 Inter-hour FORECAST Direct NORMAL IRRADIANCE Ground-based cloud images MULTIPLE data types MULTIPLE time-series
原文传递
上一页 1 2 3 下一页 到第
使用帮助 返回顶部