In this paper,we introduce the censored composite conditional quantile coefficient(cC-CQC)to rank the relative importance of each predictor in high-dimensional censored regression.The cCCQC takes advantage of all usef...In this paper,we introduce the censored composite conditional quantile coefficient(cC-CQC)to rank the relative importance of each predictor in high-dimensional censored regression.The cCCQC takes advantage of all useful information across quantiles and can detect nonlinear effects including interactions and heterogeneity,effectively.Furthermore,the proposed screening method based on cCCQC is robust to the existence of outliers and enjoys the sure screening property.Simulation results demonstrate that the proposed method performs competitively on survival datasets of high-dimensional predictors,particularly when the variables are highly correlated.展开更多
The estimation of covariance matrices is very important in many fields, such as statistics. In real applications, data are frequently influenced by high dimensions and noise. However, most relevant studies are based o...The estimation of covariance matrices is very important in many fields, such as statistics. In real applications, data are frequently influenced by high dimensions and noise. However, most relevant studies are based on complete data. This paper studies the optimal estimation of high-dimensional covariance matrices based on missing and noisy sample under the norm. First, the model with sub-Gaussian additive noise is presented. The generalized sample covariance is then modified to define a hard thresholding estimator , and the minimax upper bound is derived. After that, the minimax lower bound is derived, and it is concluded that the estimator presented in this article is rate-optimal. Finally, numerical simulation analysis is performed. The result shows that for missing samples with sub-Gaussian noise, if the true covariance matrix is sparse, the hard thresholding estimator outperforms the traditional estimate method.展开更多
In this paper,an Observation Points Classifier Ensemble(OPCE)algorithm is proposed to deal with High-Dimensional Imbalanced Classification(HDIC)problems based on data processed using the Multi-Dimensional Scaling(MDS)...In this paper,an Observation Points Classifier Ensemble(OPCE)algorithm is proposed to deal with High-Dimensional Imbalanced Classification(HDIC)problems based on data processed using the Multi-Dimensional Scaling(MDS)feature extraction technique.First,dimensionality of the original imbalanced data is reduced using MDS so that distances between any two different samples are preserved as well as possible.Second,a novel OPCE algorithm is applied to classify imbalanced samples by placing optimised observation points in a low-dimensional data space.Third,optimization of the observation point mappings is carried out to obtain a reliable assessment of the unknown samples.Exhaustive experiments have been conducted to evaluate the feasibility,rationality,and effectiveness of the proposed OPCE algorithm using seven benchmark HDIC data sets.Experimental results show that(1)the OPCE algorithm can be trained faster on low-dimensional imbalanced data than on high-dimensional data;(2)the OPCE algorithm can correctly identify samples as the number of optimised observation points is increased;and(3)statistical analysis reveals that OPCE yields better HDIC performances on the selected data sets in comparison with eight other HDIC algorithms.This demonstrates that OPCE is a viable algorithm to deal with HDIC problems.展开更多
As a crucial data preprocessing method in data mining,feature selection(FS)can be regarded as a bi-objective optimization problem that aims to maximize classification accuracy and minimize the number of selected featu...As a crucial data preprocessing method in data mining,feature selection(FS)can be regarded as a bi-objective optimization problem that aims to maximize classification accuracy and minimize the number of selected features.Evolutionary computing(EC)is promising for FS owing to its powerful search capability.However,in traditional EC-based methods,feature subsets are represented via a length-fixed individual encoding.It is ineffective for high-dimensional data,because it results in a huge search space and prohibitive training time.This work proposes a length-adaptive non-dominated sorting genetic algorithm(LA-NSGA)with a length-variable individual encoding and a length-adaptive evolution mechanism for bi-objective highdimensional FS.In LA-NSGA,an initialization method based on correlation and redundancy is devised to initialize individuals of diverse lengths,and a Pareto dominance-based length change operator is introduced to guide individuals to explore in promising search space adaptively.Moreover,a dominance-based local search method is employed for further improvement.The experimental results based on 12 high-dimensional gene datasets show that the Pareto front of feature subsets produced by LA-NSGA is superior to those of existing algorithms.展开更多
The performance of conventional similarity measurement methods is affected seriously by the curse of dimensionality of high-dimensional data.The reason is that data difference between sparse and noisy dimensionalities...The performance of conventional similarity measurement methods is affected seriously by the curse of dimensionality of high-dimensional data.The reason is that data difference between sparse and noisy dimensionalities occupies a large proportion of the similarity,leading to the dissimilarities between any results.A similarity measurement method of high-dimensional data based on normalized net lattice subspace is proposed.The data range of each dimension is divided into several intervals,and the components in different dimensions are mapped onto the corresponding interval.Only the component in the same or adjacent interval is used to calculate the similarity.To validate this method,three data types are used,and seven common similarity measurement methods are compared.The experimental result indicates that the relative difference of the method is increasing with the dimensionality and is approximately two or three orders of magnitude higher than the conventional method.In addition,the similarity range of this method in different dimensions is [0,1],which is fit for similarity analysis after dimensionality reduction.展开更多
In Brazil and various regions globally, the initiation of landslides is frequently associated with rainfall;yet the spatial arrangement of geological structures and stratification considerably influences landslide occ...In Brazil and various regions globally, the initiation of landslides is frequently associated with rainfall;yet the spatial arrangement of geological structures and stratification considerably influences landslide occurrences. The multifaceted nature of these influences makes the surveillance of mass movements a highly intricate task, requiring an understanding of numerous interdependent variables. Recent years have seen an emergence in scholarly research aimed at integrating geophysical and geotechnical methodologies. The conjoint examination of geophysical and geotechnical data offers an enhanced perspective into subsurface structures. Within this work, a methodology is proposed for the synchronous analysis of electrical resistivity geophysical data and geotechnical data, specifically those extracted from the Light Dynamic Penetrometer (DPL) and Standard Penetration Test (SPT). This study involved a linear fitting process to correlate resistivity with N10/SPT N-values from DPL/SPT soundings, culminating in a 2D profile of N10/SPT N-values predicated on electrical profiles. The findings of this research furnish invaluable insights into slope stability by allowing for a two-dimensional representation of penetration resistance properties. Through the synthesis of geophysical and geotechnical data, this project aims to augment the comprehension of subsurface conditions, with potential implications for refining landslide risk evaluations. This endeavor offers insight into the formulation of more effective and precise slope management protocols and disaster prevention strategies.展开更多
Problems existin similarity measurement and index tree construction which affect the performance of nearest neighbor search of high-dimensional data. The equidistance problem is solved using NPsim function to calculat...Problems existin similarity measurement and index tree construction which affect the performance of nearest neighbor search of high-dimensional data. The equidistance problem is solved using NPsim function to calculate similarity. And a sequential NPsim matrix is built to improve indexing performance. To sum up the above innovations,a nearest neighbor search algorithm of high-dimensional data based on sequential NPsim matrix is proposed in comparison with the nearest neighbor search algorithms based on KD-tree or SR-tree on Munsell spectral data set. Experimental results show that the proposed algorithm similarity is better than that of other algorithms and searching speed is more than thousands times of others. In addition,the slow construction speed of sequential NPsim matrix can be increased by using parallel computing.展开更多
Viticulturists traditionally have a keen interest in studying the relationship between the biochemistry of grapevines’ leaves/petioles and their associated spectral reflectance in order to understand the fruit ripeni...Viticulturists traditionally have a keen interest in studying the relationship between the biochemistry of grapevines’ leaves/petioles and their associated spectral reflectance in order to understand the fruit ripening rate, water status, nutrient levels, and disease risk. In this paper, we implement imaging spectroscopy (hyperspectral) reflectance data, for the reflective 330 - 2510 nm wavelength region (986 total spectral bands), to assess vineyard nutrient status;this constitutes a high dimensional dataset with a covariance matrix that is ill-conditioned. The identification of the variables (wavelength bands) that contribute useful information for nutrient assessment and prediction, plays a pivotal role in multivariate statistical modeling. In recent years, researchers have successfully developed many continuous, nearly unbiased, sparse and accurate variable selection methods to overcome this problem. This paper compares four regularized and one functional regression methods: Elastic Net, Multi-Step Adaptive Elastic Net, Minimax Concave Penalty, iterative Sure Independence Screening, and Functional Data Analysis for wavelength variable selection. Thereafter, the predictive performance of these regularized sparse models is enhanced using the stepwise regression. This comparative study of regression methods using a high-dimensional and highly correlated grapevine hyperspectral dataset revealed that the performance of Elastic Net for variable selection yields the best predictive ability.展开更多
Making accurate forecast or prediction is a challenging task in the big data era, in particular for those datasets involving high-dimensional variables but short-term time series points,which are generally available f...Making accurate forecast or prediction is a challenging task in the big data era, in particular for those datasets involving high-dimensional variables but short-term time series points,which are generally available from real-world systems.To address this issue, Prof.展开更多
Data organization requires high efficiency for large amount of data applied in the digital mine system. A new method of storing massive data of block model is proposed to meet the characteristics of the database, incl...Data organization requires high efficiency for large amount of data applied in the digital mine system. A new method of storing massive data of block model is proposed to meet the characteristics of the database, including ACID-compliant, concurrency support, data sharing, and efficient access. Each block model is organized by linear octree, stored in LMDB(lightning memory-mapped database). Geological attribute can be queried at any point of 3D space by comparison algorithm of location code and conversion algorithm from address code of geometry space to location code of storage. The performance and robustness of querying geological attribute at 3D spatial region are enhanced greatly by the transformation from 3D to 2D and the method of 2D grid scanning to screen the inner and outer points. Experimental results showed that this method can access the massive data of block model, meeting the database characteristics. The method with LMDB is at least 3 times faster than that with etree, especially when it is used to read. In addition, the larger the amount of data is processed, the more efficient the method would be.展开更多
Multi-area combined economic/emission dispatch(MACEED)problems are generally studied using analytical functions.However,as the scale of power systems increases,ex isting solutions become time-consuming and may not mee...Multi-area combined economic/emission dispatch(MACEED)problems are generally studied using analytical functions.However,as the scale of power systems increases,ex isting solutions become time-consuming and may not meet oper ational constraints.To overcome excessive computational ex pense in high-dimensional MACEED problems,a novel data-driven surrogate-assisted method is proposed.First,a cosine-similarity-based deep belief network combined with a back-propagation(DBN+BP)neural network is utilized to replace cost and emission functions.Second,transfer learning is applied with a pretraining and fine-tuning method to improve DBN+BP regression surrogate models,thus realizing fast con struction of surrogate models between different regional power systems.Third,a multi-objective antlion optimizer with a novel general single-dimension retention bi-objective optimization poli cy is proposed to execute MACEED optimization to obtain scheduling decisions.The proposed method not only ensures the convergence,uniformity,and extensibility of the Pareto front,but also greatly reduces the computational time.Finally,a 4-ar ea 40-unit test system with different constraints is employed to demonstrate the effectiveness of the proposed method.展开更多
Latent factor(LF) models are highly effective in extracting useful knowledge from High-Dimensional and Sparse(HiDS) matrices which are commonly seen in various industrial applications. An LF model usually adopts itera...Latent factor(LF) models are highly effective in extracting useful knowledge from High-Dimensional and Sparse(HiDS) matrices which are commonly seen in various industrial applications. An LF model usually adopts iterative optimizers,which may consume many iterations to achieve a local optima,resulting in considerable time cost. Hence, determining how to accelerate the training process for LF models has become a significant issue. To address this, this work proposes a randomized latent factor(RLF) model. It incorporates the principle of randomized learning techniques from neural networks into the LF analysis of HiDS matrices, thereby greatly alleviating computational burden. It also extends a standard learning process for randomized neural networks in context of LF analysis to make the resulting model represent an HiDS matrix correctly.Experimental results on three HiDS matrices from industrial applications demonstrate that compared with state-of-the-art LF models, RLF is able to achieve significantly higher computational efficiency and comparable prediction accuracy for missing data.I provides an important alternative approach to LF analysis of HiDS matrices, which is especially desired for industrial applications demanding highly efficient models.展开更多
Real-time prediction of the rock mass class in front of the tunnel face is essential for the adaptive adjustment of tunnel boring machines(TBMs).During the TBM tunnelling process,a large number of operation data are g...Real-time prediction of the rock mass class in front of the tunnel face is essential for the adaptive adjustment of tunnel boring machines(TBMs).During the TBM tunnelling process,a large number of operation data are generated,reflecting the interaction between the TBM system and surrounding rock,and these data can be used to evaluate the rock mass quality.This study proposed a stacking ensemble classifier for the real-time prediction of the rock mass classification using TBM operation data.Based on the Songhua River water conveyance project,a total of 7538 TBM tunnelling cycles and the corresponding rock mass classes are obtained after data preprocessing.Then,through the tree-based feature selection method,10 key TBM operation parameters are selected,and the mean values of the 10 selected features in the stable phase after removing outliers are calculated as the inputs of classifiers.The preprocessed data are randomly divided into the training set(90%)and test set(10%)using simple random sampling.Besides stacking ensemble classifier,seven individual classifiers are established as the comparison.These classifiers include support vector machine(SVM),k-nearest neighbors(KNN),random forest(RF),gradient boosting decision tree(GBDT),decision tree(DT),logistic regression(LR)and multilayer perceptron(MLP),where the hyper-parameters of each classifier are optimised using the grid search method.The prediction results show that the stacking ensemble classifier has a better performance than individual classifiers,and it shows a more powerful learning and generalisation ability for small and imbalanced samples.Additionally,a relative balance training set is obtained by the synthetic minority oversampling technique(SMOTE),and the influence of sample imbalance on the prediction performance is discussed.展开更多
Real-time perception of rock mass information is of great importance to efficient tunneling and hazard prevention in tunnel boring machines(TBMs).In this study,a TBM-rock mutual feedback perception method based on dat...Real-time perception of rock mass information is of great importance to efficient tunneling and hazard prevention in tunnel boring machines(TBMs).In this study,a TBM-rock mutual feedback perception method based on data mining(DM) is proposed,which takes 10 tunneling parameters related to surrounding rock conditions as input features.For implementation,first,the database of TBM tunneling parameters was established,in which 10,807 tunneling cycles from the Songhua River water conveyance tunnel were accommodated.Then,the spectral clustering(SC) algorithm based on graph theory was introduced to cluster the TBM tunneling data.According to the clustering results and rock mass boreability index,the rock mass conditions were classified into four classes,and the reasonable distribution intervals of the main tunneling parameters corresponding to each class were presented.Meanwhile,based on the deep neural network(DNN),the real-time prediction model regarding different rock conditions was established.Finally,the rationality and adaptability of the proposed method were validated via analyzing the tunneling specific energy,feature importance,and training dataset size.The proposed TBM-rock mutual feedback perception method enables the automatic identification of rock mass conditions and the dynamic adjustment of tunneling parameters during TBM driving.Furthermore,in terms of the prediction performance,the method can predict the rock mass conditions ahead of the tunnel face in real time more accurately than the traditional machine learning prediction methods.展开更多
Aimed at the issue that traditional clustering methods are not appropriate to high-dimensional data, a cuckoo search fuzzy-weighting algorithm for subspace clustering is presented on the basis of the exited soft subsp...Aimed at the issue that traditional clustering methods are not appropriate to high-dimensional data, a cuckoo search fuzzy-weighting algorithm for subspace clustering is presented on the basis of the exited soft subspace clustering algorithm. In the proposed algorithm, a novel objective function is firstly designed by considering the fuzzy weighting within-cluster compactness and the between-cluster separation, and loosening the constraints of dimension weight matrix. Then gradual membership and improved Cuckoo search, a global search strategy, are introduced to optimize the objective function and search subspace clusters, giving novel learning rules for clustering. At last, the performance of the proposed algorithm on the clustering analysis of various low and high dimensional datasets is experimentally compared with that of several competitive subspace clustering algorithms. Experimental studies demonstrate that the proposed algorithm can obtain better performance than most of the existing soft subspace clustering algorithms.展开更多
A single and dual parameter data acquisition, ion beam measurements and control system for accelerator mass spectrometry is described. The system hardware has been constructed with the advantage of the lower cost and ...A single and dual parameter data acquisition, ion beam measurements and control system for accelerator mass spectrometry is described. The system hardware has been constructed with the advantage of the lower cost and higher reliability. It is provided with varieties of functions such as selecting acquisition mode carrying out the multiple display, analyzing data and especially viewing isometric spectrum at different directions. It can also be used for ordinary nuclear spectrum system.展开更多
Liquid chromatography coupled to high resolution mass spectrometry(LC-HRMS)was used to establish a high throughput and sensitive method for 18 phthalic acid esters(PAEs)in textiles.The method was applied to screen PAE...Liquid chromatography coupled to high resolution mass spectrometry(LC-HRMS)was used to establish a high throughput and sensitive method for 18 phthalic acid esters(PAEs)in textiles.The method was applied to screen PAEs in 413 textile samples and 7 434 pieces of test data were analyzed.Further discussion was conducted about specific compounds,textile materials,detected value distribution and positive rate of samples.Twenty-nine percent of samples were detected with PAEs,and no samples were detected beyond regulations.Diisobutyl phthalate(DiBP),Dibutyl phthalate(DBP)and Di-2-ethylhexyl phthalate(DEHP)were detected with high frequency,and most samples were fabrics with coating and printing.This study can be used to reflect distribution of phthalate residues in textiles on the market and provide data support for risk assessment of PAEs for textile producers and consumers.展开更多
This paper studies the re-adjusted cross-validation method and a semiparametric regression model called the varying index coefficient model. We use the profile spline modal estimator method to estimate the coefficient...This paper studies the re-adjusted cross-validation method and a semiparametric regression model called the varying index coefficient model. We use the profile spline modal estimator method to estimate the coefficients of the parameter part of the Varying Index Coefficient Model (VICM), while the unknown function part uses the B-spline to expand. Moreover, we combine the above two estimation methods under the assumption of high-dimensional data. The results of data simulation and empirical analysis show that for the varying index coefficient model, the re-adjusted cross-validation method is better in terms of accuracy and stability than traditional methods based on ordinary least squares.展开更多
In this paper, a Distributed In-Memory Database (DIMDB) system is proposed to improve processing efficiency in mass data applications. The system uses an enhanced language similar to Structured Query Language (SQL...In this paper, a Distributed In-Memory Database (DIMDB) system is proposed to improve processing efficiency in mass data applications. The system uses an enhanced language similar to Structured Query Language (SQL) with a key-value storage schema. The design goals of the DIMDB system is described and its system architecture is discussed. Operation flow and the enhanced SOL-like language are also discussed, and experimental results are used to test the validity of the system.展开更多
Painting is done according to the artist’s style.The most representative of the style is the texture and shape of the brush stroke.Computer simulations allow the artist’s painting to be produced by taking this strok...Painting is done according to the artist’s style.The most representative of the style is the texture and shape of the brush stroke.Computer simulations allow the artist’s painting to be produced by taking this stroke and pasting it onto the image.This is called stroke-based rendering.The quality of the result depends on the number or quality of this stroke,since the stroke is taken to create the image.It is not easy to render using a large amount of information,as there is a limit to having a stroke scanned.In this work,we intend to produce rendering results using mass data that produces large amounts of strokes by expanding existing strokes through warping.Through this,we have produced results that have higher quality than conventional studies.Finally,we also compare the correlation between the amount of data and the results.展开更多
基金Outstanding Youth Foundation of Hunan Provincial Department of Education(Grant No.22B0911)。
文摘In this paper,we introduce the censored composite conditional quantile coefficient(cC-CQC)to rank the relative importance of each predictor in high-dimensional censored regression.The cCCQC takes advantage of all useful information across quantiles and can detect nonlinear effects including interactions and heterogeneity,effectively.Furthermore,the proposed screening method based on cCCQC is robust to the existence of outliers and enjoys the sure screening property.Simulation results demonstrate that the proposed method performs competitively on survival datasets of high-dimensional predictors,particularly when the variables are highly correlated.
文摘The estimation of covariance matrices is very important in many fields, such as statistics. In real applications, data are frequently influenced by high dimensions and noise. However, most relevant studies are based on complete data. This paper studies the optimal estimation of high-dimensional covariance matrices based on missing and noisy sample under the norm. First, the model with sub-Gaussian additive noise is presented. The generalized sample covariance is then modified to define a hard thresholding estimator , and the minimax upper bound is derived. After that, the minimax lower bound is derived, and it is concluded that the estimator presented in this article is rate-optimal. Finally, numerical simulation analysis is performed. The result shows that for missing samples with sub-Gaussian noise, if the true covariance matrix is sparse, the hard thresholding estimator outperforms the traditional estimate method.
基金National Natural Science Foundation of China,Grant/Award Number:61972261Basic Research Foundations of Shenzhen,Grant/Award Numbers:JCYJ20210324093609026,JCYJ20200813091134001。
文摘In this paper,an Observation Points Classifier Ensemble(OPCE)algorithm is proposed to deal with High-Dimensional Imbalanced Classification(HDIC)problems based on data processed using the Multi-Dimensional Scaling(MDS)feature extraction technique.First,dimensionality of the original imbalanced data is reduced using MDS so that distances between any two different samples are preserved as well as possible.Second,a novel OPCE algorithm is applied to classify imbalanced samples by placing optimised observation points in a low-dimensional data space.Third,optimization of the observation point mappings is carried out to obtain a reliable assessment of the unknown samples.Exhaustive experiments have been conducted to evaluate the feasibility,rationality,and effectiveness of the proposed OPCE algorithm using seven benchmark HDIC data sets.Experimental results show that(1)the OPCE algorithm can be trained faster on low-dimensional imbalanced data than on high-dimensional data;(2)the OPCE algorithm can correctly identify samples as the number of optimised observation points is increased;and(3)statistical analysis reveals that OPCE yields better HDIC performances on the selected data sets in comparison with eight other HDIC algorithms.This demonstrates that OPCE is a viable algorithm to deal with HDIC problems.
基金supported in part by the National Natural Science Foundation of China(62172065,62072060)。
文摘As a crucial data preprocessing method in data mining,feature selection(FS)can be regarded as a bi-objective optimization problem that aims to maximize classification accuracy and minimize the number of selected features.Evolutionary computing(EC)is promising for FS owing to its powerful search capability.However,in traditional EC-based methods,feature subsets are represented via a length-fixed individual encoding.It is ineffective for high-dimensional data,because it results in a huge search space and prohibitive training time.This work proposes a length-adaptive non-dominated sorting genetic algorithm(LA-NSGA)with a length-variable individual encoding and a length-adaptive evolution mechanism for bi-objective highdimensional FS.In LA-NSGA,an initialization method based on correlation and redundancy is devised to initialize individuals of diverse lengths,and a Pareto dominance-based length change operator is introduced to guide individuals to explore in promising search space adaptively.Moreover,a dominance-based local search method is employed for further improvement.The experimental results based on 12 high-dimensional gene datasets show that the Pareto front of feature subsets produced by LA-NSGA is superior to those of existing algorithms.
基金Supported by the National Natural Science Foundation of China(No.61502475)the Importation and Development of High-Caliber Talents Project of the Beijing Municipal Institutions(No.CIT&TCD201504039)
文摘The performance of conventional similarity measurement methods is affected seriously by the curse of dimensionality of high-dimensional data.The reason is that data difference between sparse and noisy dimensionalities occupies a large proportion of the similarity,leading to the dissimilarities between any results.A similarity measurement method of high-dimensional data based on normalized net lattice subspace is proposed.The data range of each dimension is divided into several intervals,and the components in different dimensions are mapped onto the corresponding interval.Only the component in the same or adjacent interval is used to calculate the similarity.To validate this method,three data types are used,and seven common similarity measurement methods are compared.The experimental result indicates that the relative difference of the method is increasing with the dimensionality and is approximately two or three orders of magnitude higher than the conventional method.In addition,the similarity range of this method in different dimensions is [0,1],which is fit for similarity analysis after dimensionality reduction.
文摘In Brazil and various regions globally, the initiation of landslides is frequently associated with rainfall;yet the spatial arrangement of geological structures and stratification considerably influences landslide occurrences. The multifaceted nature of these influences makes the surveillance of mass movements a highly intricate task, requiring an understanding of numerous interdependent variables. Recent years have seen an emergence in scholarly research aimed at integrating geophysical and geotechnical methodologies. The conjoint examination of geophysical and geotechnical data offers an enhanced perspective into subsurface structures. Within this work, a methodology is proposed for the synchronous analysis of electrical resistivity geophysical data and geotechnical data, specifically those extracted from the Light Dynamic Penetrometer (DPL) and Standard Penetration Test (SPT). This study involved a linear fitting process to correlate resistivity with N10/SPT N-values from DPL/SPT soundings, culminating in a 2D profile of N10/SPT N-values predicated on electrical profiles. The findings of this research furnish invaluable insights into slope stability by allowing for a two-dimensional representation of penetration resistance properties. Through the synthesis of geophysical and geotechnical data, this project aims to augment the comprehension of subsurface conditions, with potential implications for refining landslide risk evaluations. This endeavor offers insight into the formulation of more effective and precise slope management protocols and disaster prevention strategies.
基金Supported by the National Natural Science Foundation of China(No.61300078)the Importation and Development of High-Caliber Talents Project of Beijing Municipal Institutions(No.CIT&TCD201504039)+1 种基金Funding Project for Academic Human Resources Development in Beijing Union University(No.BPHR2014A03,Rk100201510)"New Start"Academic Research Projects of Beijing Union University(No.Hzk10201501)
文摘Problems existin similarity measurement and index tree construction which affect the performance of nearest neighbor search of high-dimensional data. The equidistance problem is solved using NPsim function to calculate similarity. And a sequential NPsim matrix is built to improve indexing performance. To sum up the above innovations,a nearest neighbor search algorithm of high-dimensional data based on sequential NPsim matrix is proposed in comparison with the nearest neighbor search algorithms based on KD-tree or SR-tree on Munsell spectral data set. Experimental results show that the proposed algorithm similarity is better than that of other algorithms and searching speed is more than thousands times of others. In addition,the slow construction speed of sequential NPsim matrix can be increased by using parallel computing.
文摘Viticulturists traditionally have a keen interest in studying the relationship between the biochemistry of grapevines’ leaves/petioles and their associated spectral reflectance in order to understand the fruit ripening rate, water status, nutrient levels, and disease risk. In this paper, we implement imaging spectroscopy (hyperspectral) reflectance data, for the reflective 330 - 2510 nm wavelength region (986 total spectral bands), to assess vineyard nutrient status;this constitutes a high dimensional dataset with a covariance matrix that is ill-conditioned. The identification of the variables (wavelength bands) that contribute useful information for nutrient assessment and prediction, plays a pivotal role in multivariate statistical modeling. In recent years, researchers have successfully developed many continuous, nearly unbiased, sparse and accurate variable selection methods to overcome this problem. This paper compares four regularized and one functional regression methods: Elastic Net, Multi-Step Adaptive Elastic Net, Minimax Concave Penalty, iterative Sure Independence Screening, and Functional Data Analysis for wavelength variable selection. Thereafter, the predictive performance of these regularized sparse models is enhanced using the stepwise regression. This comparative study of regression methods using a high-dimensional and highly correlated grapevine hyperspectral dataset revealed that the performance of Elastic Net for variable selection yields the best predictive ability.
基金supported by the grants from CASthe National Key R&D Program of Chinathe National Natural Science Foundation of China
文摘Making accurate forecast or prediction is a challenging task in the big data era, in particular for those datasets involving high-dimensional variables but short-term time series points,which are generally available from real-world systems.To address this issue, Prof.
基金Projects(41572317,51374242)supported by the National Natural Science Foundation of ChinaProject(2015CX005)supported by the Innovation Driven Plan of Central South University,China
文摘Data organization requires high efficiency for large amount of data applied in the digital mine system. A new method of storing massive data of block model is proposed to meet the characteristics of the database, including ACID-compliant, concurrency support, data sharing, and efficient access. Each block model is organized by linear octree, stored in LMDB(lightning memory-mapped database). Geological attribute can be queried at any point of 3D space by comparison algorithm of location code and conversion algorithm from address code of geometry space to location code of storage. The performance and robustness of querying geological attribute at 3D spatial region are enhanced greatly by the transformation from 3D to 2D and the method of 2D grid scanning to screen the inner and outer points. Experimental results showed that this method can access the massive data of block model, meeting the database characteristics. The method with LMDB is at least 3 times faster than that with etree, especially when it is used to read. In addition, the larger the amount of data is processed, the more efficient the method would be.
文摘Multi-area combined economic/emission dispatch(MACEED)problems are generally studied using analytical functions.However,as the scale of power systems increases,ex isting solutions become time-consuming and may not meet oper ational constraints.To overcome excessive computational ex pense in high-dimensional MACEED problems,a novel data-driven surrogate-assisted method is proposed.First,a cosine-similarity-based deep belief network combined with a back-propagation(DBN+BP)neural network is utilized to replace cost and emission functions.Second,transfer learning is applied with a pretraining and fine-tuning method to improve DBN+BP regression surrogate models,thus realizing fast con struction of surrogate models between different regional power systems.Third,a multi-objective antlion optimizer with a novel general single-dimension retention bi-objective optimization poli cy is proposed to execute MACEED optimization to obtain scheduling decisions.The proposed method not only ensures the convergence,uniformity,and extensibility of the Pareto front,but also greatly reduces the computational time.Finally,a 4-ar ea 40-unit test system with different constraints is employed to demonstrate the effectiveness of the proposed method.
基金supported in part by the National Natural Science Foundation of China (6177249391646114)+1 种基金Chongqing research program of technology innovation and application (cstc2017rgzn-zdyfX0020)in part by the Pioneer Hundred Talents Program of Chinese Academy of Sciences
文摘Latent factor(LF) models are highly effective in extracting useful knowledge from High-Dimensional and Sparse(HiDS) matrices which are commonly seen in various industrial applications. An LF model usually adopts iterative optimizers,which may consume many iterations to achieve a local optima,resulting in considerable time cost. Hence, determining how to accelerate the training process for LF models has become a significant issue. To address this, this work proposes a randomized latent factor(RLF) model. It incorporates the principle of randomized learning techniques from neural networks into the LF analysis of HiDS matrices, thereby greatly alleviating computational burden. It also extends a standard learning process for randomized neural networks in context of LF analysis to make the resulting model represent an HiDS matrix correctly.Experimental results on three HiDS matrices from industrial applications demonstrate that compared with state-of-the-art LF models, RLF is able to achieve significantly higher computational efficiency and comparable prediction accuracy for missing data.I provides an important alternative approach to LF analysis of HiDS matrices, which is especially desired for industrial applications demanding highly efficient models.
基金funded by the National Natural Science Foundation of China(Grant No.41941019)the State Key Laboratory of Hydroscience and Engineering(Grant No.2019-KY-03)。
文摘Real-time prediction of the rock mass class in front of the tunnel face is essential for the adaptive adjustment of tunnel boring machines(TBMs).During the TBM tunnelling process,a large number of operation data are generated,reflecting the interaction between the TBM system and surrounding rock,and these data can be used to evaluate the rock mass quality.This study proposed a stacking ensemble classifier for the real-time prediction of the rock mass classification using TBM operation data.Based on the Songhua River water conveyance project,a total of 7538 TBM tunnelling cycles and the corresponding rock mass classes are obtained after data preprocessing.Then,through the tree-based feature selection method,10 key TBM operation parameters are selected,and the mean values of the 10 selected features in the stable phase after removing outliers are calculated as the inputs of classifiers.The preprocessed data are randomly divided into the training set(90%)and test set(10%)using simple random sampling.Besides stacking ensemble classifier,seven individual classifiers are established as the comparison.These classifiers include support vector machine(SVM),k-nearest neighbors(KNN),random forest(RF),gradient boosting decision tree(GBDT),decision tree(DT),logistic regression(LR)and multilayer perceptron(MLP),where the hyper-parameters of each classifier are optimised using the grid search method.The prediction results show that the stacking ensemble classifier has a better performance than individual classifiers,and it shows a more powerful learning and generalisation ability for small and imbalanced samples.Additionally,a relative balance training set is obtained by the synthetic minority oversampling technique(SMOTE),and the influence of sample imbalance on the prediction performance is discussed.
基金supported by the National Natural Science Foundation of China(Grant Nos.41772309 and 51908431)the Outstanding Youth Foundation of Hubei Province,China(Grant No.2019CFA074)。
文摘Real-time perception of rock mass information is of great importance to efficient tunneling and hazard prevention in tunnel boring machines(TBMs).In this study,a TBM-rock mutual feedback perception method based on data mining(DM) is proposed,which takes 10 tunneling parameters related to surrounding rock conditions as input features.For implementation,first,the database of TBM tunneling parameters was established,in which 10,807 tunneling cycles from the Songhua River water conveyance tunnel were accommodated.Then,the spectral clustering(SC) algorithm based on graph theory was introduced to cluster the TBM tunneling data.According to the clustering results and rock mass boreability index,the rock mass conditions were classified into four classes,and the reasonable distribution intervals of the main tunneling parameters corresponding to each class were presented.Meanwhile,based on the deep neural network(DNN),the real-time prediction model regarding different rock conditions was established.Finally,the rationality and adaptability of the proposed method were validated via analyzing the tunneling specific energy,feature importance,and training dataset size.The proposed TBM-rock mutual feedback perception method enables the automatic identification of rock mass conditions and the dynamic adjustment of tunneling parameters during TBM driving.Furthermore,in terms of the prediction performance,the method can predict the rock mass conditions ahead of the tunnel face in real time more accurately than the traditional machine learning prediction methods.
基金supported in part by the National Natural Science Foundation of China (Nos. 61303074, 61309013)the Programs for Science, National Key Basic Research and Development Program ("973") of China (No. 2012CB315900)Technology Development of Henan province (Nos.12210231003, 13210231002)
文摘Aimed at the issue that traditional clustering methods are not appropriate to high-dimensional data, a cuckoo search fuzzy-weighting algorithm for subspace clustering is presented on the basis of the exited soft subspace clustering algorithm. In the proposed algorithm, a novel objective function is firstly designed by considering the fuzzy weighting within-cluster compactness and the between-cluster separation, and loosening the constraints of dimension weight matrix. Then gradual membership and improved Cuckoo search, a global search strategy, are introduced to optimize the objective function and search subspace clusters, giving novel learning rules for clustering. At last, the performance of the proposed algorithm on the clustering analysis of various low and high dimensional datasets is experimentally compared with that of several competitive subspace clustering algorithms. Experimental studies demonstrate that the proposed algorithm can obtain better performance than most of the existing soft subspace clustering algorithms.
文摘A single and dual parameter data acquisition, ion beam measurements and control system for accelerator mass spectrometry is described. The system hardware has been constructed with the advantage of the lower cost and higher reliability. It is provided with varieties of functions such as selecting acquisition mode carrying out the multiple display, analyzing data and especially viewing isometric spectrum at different directions. It can also be used for ordinary nuclear spectrum system.
基金National Key R&D Program of China(No.2016YFF0203702)Quality Inspection Public Welfare Scientific Research Foundation of China(No.201310062)Science Research Program of General Administration of Quality Supervision,Inspection and Quarantine of China(No.2011IK037)
文摘Liquid chromatography coupled to high resolution mass spectrometry(LC-HRMS)was used to establish a high throughput and sensitive method for 18 phthalic acid esters(PAEs)in textiles.The method was applied to screen PAEs in 413 textile samples and 7 434 pieces of test data were analyzed.Further discussion was conducted about specific compounds,textile materials,detected value distribution and positive rate of samples.Twenty-nine percent of samples were detected with PAEs,and no samples were detected beyond regulations.Diisobutyl phthalate(DiBP),Dibutyl phthalate(DBP)and Di-2-ethylhexyl phthalate(DEHP)were detected with high frequency,and most samples were fabrics with coating and printing.This study can be used to reflect distribution of phthalate residues in textiles on the market and provide data support for risk assessment of PAEs for textile producers and consumers.
文摘This paper studies the re-adjusted cross-validation method and a semiparametric regression model called the varying index coefficient model. We use the profile spline modal estimator method to estimate the coefficients of the parameter part of the Varying Index Coefficient Model (VICM), while the unknown function part uses the B-spline to expand. Moreover, we combine the above two estimation methods under the assumption of high-dimensional data. The results of data simulation and empirical analysis show that for the varying index coefficient model, the re-adjusted cross-validation method is better in terms of accuracy and stability than traditional methods based on ordinary least squares.
文摘In this paper, a Distributed In-Memory Database (DIMDB) system is proposed to improve processing efficiency in mass data applications. The system uses an enhanced language similar to Structured Query Language (SQL) with a key-value storage schema. The design goals of the DIMDB system is described and its system architecture is discussed. Operation flow and the enhanced SOL-like language are also discussed, and experimental results are used to test the validity of the system.
基金This research was supported by the Chung-Ang University Research Scholarship Grants in 2017.
文摘Painting is done according to the artist’s style.The most representative of the style is the texture and shape of the brush stroke.Computer simulations allow the artist’s painting to be produced by taking this stroke and pasting it onto the image.This is called stroke-based rendering.The quality of the result depends on the number or quality of this stroke,since the stroke is taken to create the image.It is not easy to render using a large amount of information,as there is a limit to having a stroke scanned.In this work,we intend to produce rendering results using mass data that produces large amounts of strokes by expanding existing strokes through warping.Through this,we have produced results that have higher quality than conventional studies.Finally,we also compare the correlation between the amount of data and the results.