High-dimensional and incomplete(HDI) matrices are primarily generated in all kinds of big-data-related practical applications. A latent factor analysis(LFA) model is capable of conducting efficient representation lear...High-dimensional and incomplete(HDI) matrices are primarily generated in all kinds of big-data-related practical applications. A latent factor analysis(LFA) model is capable of conducting efficient representation learning to an HDI matrix,whose hyper-parameter adaptation can be implemented through a particle swarm optimizer(PSO) to meet scalable requirements.However, conventional PSO is limited by its premature issues,which leads to the accuracy loss of a resultant LFA model. To address this thorny issue, this study merges the information of each particle's state migration into its evolution process following the principle of a generalized momentum method for improving its search ability, thereby building a state-migration particle swarm optimizer(SPSO), whose theoretical convergence is rigorously proved in this study. It is then incorporated into an LFA model for implementing efficient hyper-parameter adaptation without accuracy loss. Experiments on six HDI matrices indicate that an SPSO-incorporated LFA model outperforms state-of-the-art LFA models in terms of prediction accuracy for missing data of an HDI matrix with competitive computational efficiency.Hence, SPSO's use ensures efficient and reliable hyper-parameter adaptation in an LFA model, thus ensuring practicality and accurate representation learning for HDI matrices.展开更多
In this paper,we introduce the censored composite conditional quantile coefficient(cC-CQC)to rank the relative importance of each predictor in high-dimensional censored regression.The cCCQC takes advantage of all usef...In this paper,we introduce the censored composite conditional quantile coefficient(cC-CQC)to rank the relative importance of each predictor in high-dimensional censored regression.The cCCQC takes advantage of all useful information across quantiles and can detect nonlinear effects including interactions and heterogeneity,effectively.Furthermore,the proposed screening method based on cCCQC is robust to the existence of outliers and enjoys the sure screening property.Simulation results demonstrate that the proposed method performs competitively on survival datasets of high-dimensional predictors,particularly when the variables are highly correlated.展开更多
The estimation of covariance matrices is very important in many fields, such as statistics. In real applications, data are frequently influenced by high dimensions and noise. However, most relevant studies are based o...The estimation of covariance matrices is very important in many fields, such as statistics. In real applications, data are frequently influenced by high dimensions and noise. However, most relevant studies are based on complete data. This paper studies the optimal estimation of high-dimensional covariance matrices based on missing and noisy sample under the norm. First, the model with sub-Gaussian additive noise is presented. The generalized sample covariance is then modified to define a hard thresholding estimator , and the minimax upper bound is derived. After that, the minimax lower bound is derived, and it is concluded that the estimator presented in this article is rate-optimal. Finally, numerical simulation analysis is performed. The result shows that for missing samples with sub-Gaussian noise, if the true covariance matrix is sparse, the hard thresholding estimator outperforms the traditional estimate method.展开更多
The performance of conventional similarity measurement methods is affected seriously by the curse of dimensionality of high-dimensional data.The reason is that data difference between sparse and noisy dimensionalities...The performance of conventional similarity measurement methods is affected seriously by the curse of dimensionality of high-dimensional data.The reason is that data difference between sparse and noisy dimensionalities occupies a large proportion of the similarity,leading to the dissimilarities between any results.A similarity measurement method of high-dimensional data based on normalized net lattice subspace is proposed.The data range of each dimension is divided into several intervals,and the components in different dimensions are mapped onto the corresponding interval.Only the component in the same or adjacent interval is used to calculate the similarity.To validate this method,three data types are used,and seven common similarity measurement methods are compared.The experimental result indicates that the relative difference of the method is increasing with the dimensionality and is approximately two or three orders of magnitude higher than the conventional method.In addition,the similarity range of this method in different dimensions is [0,1],which is fit for similarity analysis after dimensionality reduction.展开更多
The security of Federated Learning(FL)/Distributed Machine Learning(DML)is gravely threatened by data poisoning attacks,which destroy the usability of the model by contaminating training samples,so such attacks are ca...The security of Federated Learning(FL)/Distributed Machine Learning(DML)is gravely threatened by data poisoning attacks,which destroy the usability of the model by contaminating training samples,so such attacks are called causative availability indiscriminate attacks.Facing the problem that existing data sanitization methods are hard to apply to real-time applications due to their tedious process and heavy computations,we propose a new supervised batch detection method for poison,which can fleetly sanitize the training dataset before the local model training.We design a training dataset generation method that helps to enhance accuracy and uses data complexity features to train a detection model,which will be used in an efficient batch hierarchical detection process.Our model stockpiles knowledge about poison,which can be expanded by retraining to adapt to new attacks.Being neither attack-specific nor scenario-specific,our method is applicable to FL/DML or other online or offline scenarios.展开更多
A data lake(DL),abbreviated as DL,denotes a vast reservoir or repository of data.It accumulates substantial volumes of data and employs advanced analytics to correlate data from diverse origins containing various form...A data lake(DL),abbreviated as DL,denotes a vast reservoir or repository of data.It accumulates substantial volumes of data and employs advanced analytics to correlate data from diverse origins containing various forms of semi-structured,structured,and unstructured information.These systems use a flat architecture and run different types of data analytics.NoSQL databases are nontabular and store data in a different manner than the relational table.NoSQL databases come in various forms,including key-value pairs,documents,wide columns,and graphs,each based on its data model.They offer simpler scalability and generally outperform traditional relational databases.While NoSQL databases can store diverse data types,they lack full support for atomicity,consistency,isolation,and durability features found in relational databases.Consequently,employing machine learning approaches becomes necessary to categorize complex structured query language(SQL)queries.Results indicate that the most frequently used automatic classification technique in processing SQL queries on NoSQL databases is machine learning-based classification.Overall,this study provides an overview of the automatic classification techniques used in processing SQL queries on NoSQL databases.Understanding these techniques can aid in the development of effective and efficient NoSQL database applications.展开更多
A nonlocality distillation protocol for arbitrary high-dimensional systems is proposed. We study the nonlocality distillation in the 2-input d-output bi-partite case. Firstly, we give the one-parameter nonlocal boxes ...A nonlocality distillation protocol for arbitrary high-dimensional systems is proposed. We study the nonlocality distillation in the 2-input d-output bi-partite case. Firstly, we give the one-parameter nonlocal boxes and their correlated distilling protocol. Then, we generalize the one-parameter nonlocality distillation protocol to the twoparameter case. Furthermore, we introduce a contracting protocol testifying that the 2-input d-output nonlocal boxes make communication complexity trivial.展开更多
Problems existin similarity measurement and index tree construction which affect the performance of nearest neighbor search of high-dimensional data. The equidistance problem is solved using NPsim function to calculat...Problems existin similarity measurement and index tree construction which affect the performance of nearest neighbor search of high-dimensional data. The equidistance problem is solved using NPsim function to calculate similarity. And a sequential NPsim matrix is built to improve indexing performance. To sum up the above innovations,a nearest neighbor search algorithm of high-dimensional data based on sequential NPsim matrix is proposed in comparison with the nearest neighbor search algorithms based on KD-tree or SR-tree on Munsell spectral data set. Experimental results show that the proposed algorithm similarity is better than that of other algorithms and searching speed is more than thousands times of others. In addition,the slow construction speed of sequential NPsim matrix can be increased by using parallel computing.展开更多
Viticulturists traditionally have a keen interest in studying the relationship between the biochemistry of grapevines’ leaves/petioles and their associated spectral reflectance in order to understand the fruit ripeni...Viticulturists traditionally have a keen interest in studying the relationship between the biochemistry of grapevines’ leaves/petioles and their associated spectral reflectance in order to understand the fruit ripening rate, water status, nutrient levels, and disease risk. In this paper, we implement imaging spectroscopy (hyperspectral) reflectance data, for the reflective 330 - 2510 nm wavelength region (986 total spectral bands), to assess vineyard nutrient status;this constitutes a high dimensional dataset with a covariance matrix that is ill-conditioned. The identification of the variables (wavelength bands) that contribute useful information for nutrient assessment and prediction, plays a pivotal role in multivariate statistical modeling. In recent years, researchers have successfully developed many continuous, nearly unbiased, sparse and accurate variable selection methods to overcome this problem. This paper compares four regularized and one functional regression methods: Elastic Net, Multi-Step Adaptive Elastic Net, Minimax Concave Penalty, iterative Sure Independence Screening, and Functional Data Analysis for wavelength variable selection. Thereafter, the predictive performance of these regularized sparse models is enhanced using the stepwise regression. This comparative study of regression methods using a high-dimensional and highly correlated grapevine hyperspectral dataset revealed that the performance of Elastic Net for variable selection yields the best predictive ability.展开更多
Making accurate forecast or prediction is a challenging task in the big data era, in particular for those datasets involving high-dimensional variables but short-term time series points,which are generally available f...Making accurate forecast or prediction is a challenging task in the big data era, in particular for those datasets involving high-dimensional variables but short-term time series points,which are generally available from real-world systems.To address this issue, Prof.展开更多
Latent factor(LF) models are highly effective in extracting useful knowledge from High-Dimensional and Sparse(HiDS) matrices which are commonly seen in various industrial applications. An LF model usually adopts itera...Latent factor(LF) models are highly effective in extracting useful knowledge from High-Dimensional and Sparse(HiDS) matrices which are commonly seen in various industrial applications. An LF model usually adopts iterative optimizers,which may consume many iterations to achieve a local optima,resulting in considerable time cost. Hence, determining how to accelerate the training process for LF models has become a significant issue. To address this, this work proposes a randomized latent factor(RLF) model. It incorporates the principle of randomized learning techniques from neural networks into the LF analysis of HiDS matrices, thereby greatly alleviating computational burden. It also extends a standard learning process for randomized neural networks in context of LF analysis to make the resulting model represent an HiDS matrix correctly.Experimental results on three HiDS matrices from industrial applications demonstrate that compared with state-of-the-art LF models, RLF is able to achieve significantly higher computational efficiency and comparable prediction accuracy for missing data.I provides an important alternative approach to LF analysis of HiDS matrices, which is especially desired for industrial applications demanding highly efficient models.展开更多
On November 13, 2016, an MW7.8 earthquake struck Kaikoura in South Island of New Zealand. By means of back-projection of array recordings, ASTFs-analysis of global seismic recordings, and joint inversion of global sei...On November 13, 2016, an MW7.8 earthquake struck Kaikoura in South Island of New Zealand. By means of back-projection of array recordings, ASTFs-analysis of global seismic recordings, and joint inversion of global seismic data and co-seismic In SAR data, we investigated complexity of the earthquake source. The result shows that the 2016 MW7.8 Kaikoura earthquake ruptured about 100 s unilaterally from south to northeast(~N28°–33°E), producing a rupture area about 160 km long and about 50 km wide and releasing scalar moment 1.01×1021 Nm. In particular, the rupture area consisted of two slip asperities, with one close to the initial rupture point having a maximal slip value ~6.9 m while the other far away in the northeast having a maximal slip value ~9.3 m. The first asperity slipped for about 65 s and the second one started 40 s after the first one had initiated. The two slipped simultaneously for about 25 s.Furthermore, the first had a nearly thrust slip while the second had both thrust and strike slip. It is interesting that the rupture velocity was not constant, and the whole process may be divided into 5 stages in which the velocities were estimated to be 1.4 km/s, 0 km/s, 2.1 km/s, 0 km/s and 1.1 km/s, respectively. The high-frequency sources distributed nearly along the lower edge of the rupture area, the highfrequency radiating mainly occurred at launching of the asperities, and it seemed that no high-frequency energy was radiated when the rupturing was going to stop.展开更多
To increase the storage capacity in holographic data storage(HDS),the information to be stored is encoded into a complex amplitude.Fast and accurate retrieval of amplitude and phase from the reconstructed beam is nece...To increase the storage capacity in holographic data storage(HDS),the information to be stored is encoded into a complex amplitude.Fast and accurate retrieval of amplitude and phase from the reconstructed beam is necessary during data readout in HDS.In this study,we proposed a complex amplitude demodulation method based on deep learning from a single-shot diffraction intensity image and verified it by a non-interferometric lensless experiment demodulating four-level amplitude and four-level phase.By analyzing the correlation between the diffraction intensity features and the amplitude and phase encoding data pages,the inverse problem was decomposed into two backward operators denoted by two convolutional neural networks(CNNs)to demodulate amplitude and phase respectively.The experimental system is simple,stable,and robust,and it only needs a single diffraction image to realize the direct demodulation of both amplitude and phase.To our investigation,this is the first time in HDS that multilevel complex amplitude demodulation is achieved experimentally from one diffraction intensity image without iterations.展开更多
Aimed at the issue that traditional clustering methods are not appropriate to high-dimensional data, a cuckoo search fuzzy-weighting algorithm for subspace clustering is presented on the basis of the exited soft subsp...Aimed at the issue that traditional clustering methods are not appropriate to high-dimensional data, a cuckoo search fuzzy-weighting algorithm for subspace clustering is presented on the basis of the exited soft subspace clustering algorithm. In the proposed algorithm, a novel objective function is firstly designed by considering the fuzzy weighting within-cluster compactness and the between-cluster separation, and loosening the constraints of dimension weight matrix. Then gradual membership and improved Cuckoo search, a global search strategy, are introduced to optimize the objective function and search subspace clusters, giving novel learning rules for clustering. At last, the performance of the proposed algorithm on the clustering analysis of various low and high dimensional datasets is experimentally compared with that of several competitive subspace clustering algorithms. Experimental studies demonstrate that the proposed algorithm can obtain better performance than most of the existing soft subspace clustering algorithms.展开更多
In this paper,an Observation Points Classifier Ensemble(OPCE)algorithm is proposed to deal with High-Dimensional Imbalanced Classification(HDIC)problems based on data processed using the Multi-Dimensional Scaling(MDS)...In this paper,an Observation Points Classifier Ensemble(OPCE)algorithm is proposed to deal with High-Dimensional Imbalanced Classification(HDIC)problems based on data processed using the Multi-Dimensional Scaling(MDS)feature extraction technique.First,dimensionality of the original imbalanced data is reduced using MDS so that distances between any two different samples are preserved as well as possible.Second,a novel OPCE algorithm is applied to classify imbalanced samples by placing optimised observation points in a low-dimensional data space.Third,optimization of the observation point mappings is carried out to obtain a reliable assessment of the unknown samples.Exhaustive experiments have been conducted to evaluate the feasibility,rationality,and effectiveness of the proposed OPCE algorithm using seven benchmark HDIC data sets.Experimental results show that(1)the OPCE algorithm can be trained faster on low-dimensional imbalanced data than on high-dimensional data;(2)the OPCE algorithm can correctly identify samples as the number of optimised observation points is increased;and(3)statistical analysis reveals that OPCE yields better HDIC performances on the selected data sets in comparison with eight other HDIC algorithms.This demonstrates that OPCE is a viable algorithm to deal with HDIC problems.展开更多
Baddeleyite is an important mineral geochronometer. It is valued in the U-Pb (ID-TIMS) geochronology more than zircon because of its magmatic origin, while zircon can be metamorphic, hydrothermal or occur as xenocryst...Baddeleyite is an important mineral geochronometer. It is valued in the U-Pb (ID-TIMS) geochronology more than zircon because of its magmatic origin, while zircon can be metamorphic, hydrothermal or occur as xenocrysts. Detailed mineralogical (BSE, KL, etc.) research of baddeleyite started in the Fennoscandian Shield in the 1990s. The mineral was first extracted from the Paleozoic Kovdor deposit, the second-biggest baddeleyite deposit in the world after Phalaborwa (2.1 Ga), South Africa. The mineral was successfully introduced into the U-Pb systematics. This study provides new U-Pb and LA-ICP-MS data on Archean Ti-Mgt and BIF deposits, Paleoproterozoic layered PGE intrusions with Pt-Pd and Cu-Ni reefs and Paleozoic complex deposits (baddeleyite, apatite, foscorite ores, etc.) in the NE Fennoscandian Shield. Data on concentrations of REE in baddeleyite and temperature of the U-Pb systematics closure are also provided. It is shown that baddeleyite plays an important role in the geological history of the Earth, in particular, in the break-up of supercontinents.展开更多
As a crucial data preprocessing method in data mining,feature selection(FS)can be regarded as a bi-objective optimization problem that aims to maximize classification accuracy and minimize the number of selected featu...As a crucial data preprocessing method in data mining,feature selection(FS)can be regarded as a bi-objective optimization problem that aims to maximize classification accuracy and minimize the number of selected features.Evolutionary computing(EC)is promising for FS owing to its powerful search capability.However,in traditional EC-based methods,feature subsets are represented via a length-fixed individual encoding.It is ineffective for high-dimensional data,because it results in a huge search space and prohibitive training time.This work proposes a length-adaptive non-dominated sorting genetic algorithm(LA-NSGA)with a length-variable individual encoding and a length-adaptive evolution mechanism for bi-objective highdimensional FS.In LA-NSGA,an initialization method based on correlation and redundancy is devised to initialize individuals of diverse lengths,and a Pareto dominance-based length change operator is introduced to guide individuals to explore in promising search space adaptively.Moreover,a dominance-based local search method is employed for further improvement.The experimental results based on 12 high-dimensional gene datasets show that the Pareto front of feature subsets produced by LA-NSGA is superior to those of existing algorithms.展开更多
The study investigated user experience, display complexity, display type (tables versus graphs), and task difficulty as variables affecting the user’s ability to navigate through complex visual data. A total of 64 pa...The study investigated user experience, display complexity, display type (tables versus graphs), and task difficulty as variables affecting the user’s ability to navigate through complex visual data. A total of 64 participants, 39 undergraduate students (novice users) and 25 graduate students (intermediate-level users) participated in the study. The experimental design was 2 × 2 × 2 × 3 mixed design using two between-subject variables (display complexity, user experience) and two within-subject variables (display format, question difficulty). The results indicated that response time was superior for graphs (relative to tables), especially when the questions were difficult. The intermediate users seemed to adopt more extensive search strategies than novices, as revealed by an analysis of the number of changes they made to the display prior to answering questions. It was concluded that designers of data displays should consider the (a) type of display, (b) difficulty of the task, and (c) expertise level of the user to obtain optimal levels of performance.展开更多
Complex survey designs often involve unequal selection probabilities of clus-ters or units within clusters. When estimating models for complex survey data, scaled weights are incorporated into the likelihood, producin...Complex survey designs often involve unequal selection probabilities of clus-ters or units within clusters. When estimating models for complex survey data, scaled weights are incorporated into the likelihood, producing a pseudo likeli-hood. In a 3-level weighted analysis for a binary outcome, we implemented two methods for scaling the sampling weights in the National Health Survey of Pa-kistan (NHSP). For NHSP with health care utilization as a binary outcome we found age, gender, household (HH) goods, urban/rural status, community de-velopment index, province and marital status as significant predictors of health care utilization (p-value < 0.05). The variance of the random intercepts using scaling method 1 is estimated as 0.0961 (standard error 0.0339) for PSU level, and 0.2726 (standard error 0.0995) for household level respectively. Both esti-mates are significantly different from zero (p-value < 0.05) and indicate consid-erable heterogeneity in health care utilization with respect to households and PSUs. The results of the NHSP data analysis showed that all three analyses, weighted (two scaling methods) and un-weighted, converged to almost identical results with few exceptions. This may have occurred because of the large num-ber of 3rd and 2nd level clusters and relatively small ICC. We performed a sim-ulation study to assess the effect of varying prevalence and intra-class correla-tion coefficients (ICCs) on bias of fixed effect parameters and variance components of a multilevel pseudo maximum likelihood (weighted) analysis. The simulation results showed that the performance of the scaled weighted estimators is satisfactory for both scaling methods. Incorporating simulation into the analysis of complex multilevel surveys allows the integrity of the results to be tested and is recommended as good practice.展开更多
In studies of HIV, interval-censored data occur naturally. HIV infection time is not usually known exactly, only that it occurred before the survey, within some time interval or has not occurred at the time of the sur...In studies of HIV, interval-censored data occur naturally. HIV infection time is not usually known exactly, only that it occurred before the survey, within some time interval or has not occurred at the time of the survey. Infections are often clustered within geographical areas such as enumerator areas (EAs) and thus inducing unobserved frailty. In this paper we consider an approach for estimating parameters when infection time is unknown and assumed correlated within an EA where dependency is modeled as frailties assuming a normal distribution for frailties and a Weibull distribution for baseline hazards. The data was from a household based population survey that used a multi-stage stratified sample design to randomly select 23,275 interviewed individuals from 10,584 households of whom 15,851 interviewed individuals were further tested for HIV (crude prevalence = 9.1%). A further test conducted among those that tested HIV positive found 181 (12.5%) recently infected. Results show high degree of heterogeneity in HIV distribution between EAs translating to a modest correlation of 0.198. Intervention strategies should target geographical areas that contribute disproportionately to the epidemic of HIV. Further research needs to identify such hot spot areas and understand what factors make these areas prone to HIV.展开更多
基金supported in part by the National Natural Science Foundation of China (62372385, 62272078, 62002337)the Chongqing Natural Science Foundation (CSTB2022NSCQ-MSX1486, CSTB2023NSCQ-LZX0069)the Deanship of Scientific Research at King Abdulaziz University, Jeddah, Saudi Arabia (RG-12-135-43)。
文摘High-dimensional and incomplete(HDI) matrices are primarily generated in all kinds of big-data-related practical applications. A latent factor analysis(LFA) model is capable of conducting efficient representation learning to an HDI matrix,whose hyper-parameter adaptation can be implemented through a particle swarm optimizer(PSO) to meet scalable requirements.However, conventional PSO is limited by its premature issues,which leads to the accuracy loss of a resultant LFA model. To address this thorny issue, this study merges the information of each particle's state migration into its evolution process following the principle of a generalized momentum method for improving its search ability, thereby building a state-migration particle swarm optimizer(SPSO), whose theoretical convergence is rigorously proved in this study. It is then incorporated into an LFA model for implementing efficient hyper-parameter adaptation without accuracy loss. Experiments on six HDI matrices indicate that an SPSO-incorporated LFA model outperforms state-of-the-art LFA models in terms of prediction accuracy for missing data of an HDI matrix with competitive computational efficiency.Hence, SPSO's use ensures efficient and reliable hyper-parameter adaptation in an LFA model, thus ensuring practicality and accurate representation learning for HDI matrices.
基金Outstanding Youth Foundation of Hunan Provincial Department of Education(Grant No.22B0911)。
文摘In this paper,we introduce the censored composite conditional quantile coefficient(cC-CQC)to rank the relative importance of each predictor in high-dimensional censored regression.The cCCQC takes advantage of all useful information across quantiles and can detect nonlinear effects including interactions and heterogeneity,effectively.Furthermore,the proposed screening method based on cCCQC is robust to the existence of outliers and enjoys the sure screening property.Simulation results demonstrate that the proposed method performs competitively on survival datasets of high-dimensional predictors,particularly when the variables are highly correlated.
文摘The estimation of covariance matrices is very important in many fields, such as statistics. In real applications, data are frequently influenced by high dimensions and noise. However, most relevant studies are based on complete data. This paper studies the optimal estimation of high-dimensional covariance matrices based on missing and noisy sample under the norm. First, the model with sub-Gaussian additive noise is presented. The generalized sample covariance is then modified to define a hard thresholding estimator , and the minimax upper bound is derived. After that, the minimax lower bound is derived, and it is concluded that the estimator presented in this article is rate-optimal. Finally, numerical simulation analysis is performed. The result shows that for missing samples with sub-Gaussian noise, if the true covariance matrix is sparse, the hard thresholding estimator outperforms the traditional estimate method.
基金Supported by the National Natural Science Foundation of China(No.61502475)the Importation and Development of High-Caliber Talents Project of the Beijing Municipal Institutions(No.CIT&TCD201504039)
文摘The performance of conventional similarity measurement methods is affected seriously by the curse of dimensionality of high-dimensional data.The reason is that data difference between sparse and noisy dimensionalities occupies a large proportion of the similarity,leading to the dissimilarities between any results.A similarity measurement method of high-dimensional data based on normalized net lattice subspace is proposed.The data range of each dimension is divided into several intervals,and the components in different dimensions are mapped onto the corresponding interval.Only the component in the same or adjacent interval is used to calculate the similarity.To validate this method,three data types are used,and seven common similarity measurement methods are compared.The experimental result indicates that the relative difference of the method is increasing with the dimensionality and is approximately two or three orders of magnitude higher than the conventional method.In addition,the similarity range of this method in different dimensions is [0,1],which is fit for similarity analysis after dimensionality reduction.
基金supported in part by the“Pioneer”and“Leading Goose”R&D Program of Zhejiang(Grant No.2022C03174)the National Natural Science Foundation of China(No.92067103)+4 种基金the Key Research and Development Program of Shaanxi,China(No.2021ZDLGY06-02)the Natural Science Foundation of Shaanxi Province(No.2019ZDLGY12-02)the Shaanxi Innovation Team Project(No.2018TD-007)the Xi'an Science and technology Innovation Plan(No.201809168CX9JC10)the Fundamental Research Funds for the Central Universities(No.YJS2212)and National 111 Program of China B16037.
文摘The security of Federated Learning(FL)/Distributed Machine Learning(DML)is gravely threatened by data poisoning attacks,which destroy the usability of the model by contaminating training samples,so such attacks are called causative availability indiscriminate attacks.Facing the problem that existing data sanitization methods are hard to apply to real-time applications due to their tedious process and heavy computations,we propose a new supervised batch detection method for poison,which can fleetly sanitize the training dataset before the local model training.We design a training dataset generation method that helps to enhance accuracy and uses data complexity features to train a detection model,which will be used in an efficient batch hierarchical detection process.Our model stockpiles knowledge about poison,which can be expanded by retraining to adapt to new attacks.Being neither attack-specific nor scenario-specific,our method is applicable to FL/DML or other online or offline scenarios.
基金supported by the Student Scheme provided by Universiti Kebangsaan Malaysia with the Code TAP-20558.
文摘A data lake(DL),abbreviated as DL,denotes a vast reservoir or repository of data.It accumulates substantial volumes of data and employs advanced analytics to correlate data from diverse origins containing various forms of semi-structured,structured,and unstructured information.These systems use a flat architecture and run different types of data analytics.NoSQL databases are nontabular and store data in a different manner than the relational table.NoSQL databases come in various forms,including key-value pairs,documents,wide columns,and graphs,each based on its data model.They offer simpler scalability and generally outperform traditional relational databases.While NoSQL databases can store diverse data types,they lack full support for atomicity,consistency,isolation,and durability features found in relational databases.Consequently,employing machine learning approaches becomes necessary to categorize complex structured query language(SQL)queries.Results indicate that the most frequently used automatic classification technique in processing SQL queries on NoSQL databases is machine learning-based classification.Overall,this study provides an overview of the automatic classification techniques used in processing SQL queries on NoSQL databases.Understanding these techniques can aid in the development of effective and efficient NoSQL database applications.
基金Supported by the National Basic Research Program of China under Grant No 2012CB921900the National Natural Science Foundation of China under Grant Nos 11175089 and 11475089
文摘A nonlocality distillation protocol for arbitrary high-dimensional systems is proposed. We study the nonlocality distillation in the 2-input d-output bi-partite case. Firstly, we give the one-parameter nonlocal boxes and their correlated distilling protocol. Then, we generalize the one-parameter nonlocality distillation protocol to the twoparameter case. Furthermore, we introduce a contracting protocol testifying that the 2-input d-output nonlocal boxes make communication complexity trivial.
基金Supported by the National Natural Science Foundation of China(No.61300078)the Importation and Development of High-Caliber Talents Project of Beijing Municipal Institutions(No.CIT&TCD201504039)+1 种基金Funding Project for Academic Human Resources Development in Beijing Union University(No.BPHR2014A03,Rk100201510)"New Start"Academic Research Projects of Beijing Union University(No.Hzk10201501)
文摘Problems existin similarity measurement and index tree construction which affect the performance of nearest neighbor search of high-dimensional data. The equidistance problem is solved using NPsim function to calculate similarity. And a sequential NPsim matrix is built to improve indexing performance. To sum up the above innovations,a nearest neighbor search algorithm of high-dimensional data based on sequential NPsim matrix is proposed in comparison with the nearest neighbor search algorithms based on KD-tree or SR-tree on Munsell spectral data set. Experimental results show that the proposed algorithm similarity is better than that of other algorithms and searching speed is more than thousands times of others. In addition,the slow construction speed of sequential NPsim matrix can be increased by using parallel computing.
文摘Viticulturists traditionally have a keen interest in studying the relationship between the biochemistry of grapevines’ leaves/petioles and their associated spectral reflectance in order to understand the fruit ripening rate, water status, nutrient levels, and disease risk. In this paper, we implement imaging spectroscopy (hyperspectral) reflectance data, for the reflective 330 - 2510 nm wavelength region (986 total spectral bands), to assess vineyard nutrient status;this constitutes a high dimensional dataset with a covariance matrix that is ill-conditioned. The identification of the variables (wavelength bands) that contribute useful information for nutrient assessment and prediction, plays a pivotal role in multivariate statistical modeling. In recent years, researchers have successfully developed many continuous, nearly unbiased, sparse and accurate variable selection methods to overcome this problem. This paper compares four regularized and one functional regression methods: Elastic Net, Multi-Step Adaptive Elastic Net, Minimax Concave Penalty, iterative Sure Independence Screening, and Functional Data Analysis for wavelength variable selection. Thereafter, the predictive performance of these regularized sparse models is enhanced using the stepwise regression. This comparative study of regression methods using a high-dimensional and highly correlated grapevine hyperspectral dataset revealed that the performance of Elastic Net for variable selection yields the best predictive ability.
基金supported by the grants from CASthe National Key R&D Program of Chinathe National Natural Science Foundation of China
文摘Making accurate forecast or prediction is a challenging task in the big data era, in particular for those datasets involving high-dimensional variables but short-term time series points,which are generally available from real-world systems.To address this issue, Prof.
基金supported in part by the National Natural Science Foundation of China (6177249391646114)+1 种基金Chongqing research program of technology innovation and application (cstc2017rgzn-zdyfX0020)in part by the Pioneer Hundred Talents Program of Chinese Academy of Sciences
文摘Latent factor(LF) models are highly effective in extracting useful knowledge from High-Dimensional and Sparse(HiDS) matrices which are commonly seen in various industrial applications. An LF model usually adopts iterative optimizers,which may consume many iterations to achieve a local optima,resulting in considerable time cost. Hence, determining how to accelerate the training process for LF models has become a significant issue. To address this, this work proposes a randomized latent factor(RLF) model. It incorporates the principle of randomized learning techniques from neural networks into the LF analysis of HiDS matrices, thereby greatly alleviating computational burden. It also extends a standard learning process for randomized neural networks in context of LF analysis to make the resulting model represent an HiDS matrix correctly.Experimental results on three HiDS matrices from industrial applications demonstrate that compared with state-of-the-art LF models, RLF is able to achieve significantly higher computational efficiency and comparable prediction accuracy for missing data.I provides an important alternative approach to LF analysis of HiDS matrices, which is especially desired for industrial applications demanding highly efficient models.
基金supported by the NSFC project (41474046)the DQJB project (DQJB16B05) of the Institute of Geophysics, CEA
文摘On November 13, 2016, an MW7.8 earthquake struck Kaikoura in South Island of New Zealand. By means of back-projection of array recordings, ASTFs-analysis of global seismic recordings, and joint inversion of global seismic data and co-seismic In SAR data, we investigated complexity of the earthquake source. The result shows that the 2016 MW7.8 Kaikoura earthquake ruptured about 100 s unilaterally from south to northeast(~N28°–33°E), producing a rupture area about 160 km long and about 50 km wide and releasing scalar moment 1.01×1021 Nm. In particular, the rupture area consisted of two slip asperities, with one close to the initial rupture point having a maximal slip value ~6.9 m while the other far away in the northeast having a maximal slip value ~9.3 m. The first asperity slipped for about 65 s and the second one started 40 s after the first one had initiated. The two slipped simultaneously for about 25 s.Furthermore, the first had a nearly thrust slip while the second had both thrust and strike slip. It is interesting that the rupture velocity was not constant, and the whole process may be divided into 5 stages in which the velocities were estimated to be 1.4 km/s, 0 km/s, 2.1 km/s, 0 km/s and 1.1 km/s, respectively. The high-frequency sources distributed nearly along the lower edge of the rupture area, the highfrequency radiating mainly occurred at launching of the asperities, and it seemed that no high-frequency energy was radiated when the rupturing was going to stop.
基金We are grateful for financial supports from National Key Research and Development Program of China(2018YFA0701800)Project of Fujian Province Major Science and Technology(2020HZ01012)+1 种基金Natural Science Foundation of Fujian Province(2021J01160)National Natural Science Foundation of China(62061136005).
文摘To increase the storage capacity in holographic data storage(HDS),the information to be stored is encoded into a complex amplitude.Fast and accurate retrieval of amplitude and phase from the reconstructed beam is necessary during data readout in HDS.In this study,we proposed a complex amplitude demodulation method based on deep learning from a single-shot diffraction intensity image and verified it by a non-interferometric lensless experiment demodulating four-level amplitude and four-level phase.By analyzing the correlation between the diffraction intensity features and the amplitude and phase encoding data pages,the inverse problem was decomposed into two backward operators denoted by two convolutional neural networks(CNNs)to demodulate amplitude and phase respectively.The experimental system is simple,stable,and robust,and it only needs a single diffraction image to realize the direct demodulation of both amplitude and phase.To our investigation,this is the first time in HDS that multilevel complex amplitude demodulation is achieved experimentally from one diffraction intensity image without iterations.
基金supported in part by the National Natural Science Foundation of China (Nos. 61303074, 61309013)the Programs for Science, National Key Basic Research and Development Program ("973") of China (No. 2012CB315900)Technology Development of Henan province (Nos.12210231003, 13210231002)
文摘Aimed at the issue that traditional clustering methods are not appropriate to high-dimensional data, a cuckoo search fuzzy-weighting algorithm for subspace clustering is presented on the basis of the exited soft subspace clustering algorithm. In the proposed algorithm, a novel objective function is firstly designed by considering the fuzzy weighting within-cluster compactness and the between-cluster separation, and loosening the constraints of dimension weight matrix. Then gradual membership and improved Cuckoo search, a global search strategy, are introduced to optimize the objective function and search subspace clusters, giving novel learning rules for clustering. At last, the performance of the proposed algorithm on the clustering analysis of various low and high dimensional datasets is experimentally compared with that of several competitive subspace clustering algorithms. Experimental studies demonstrate that the proposed algorithm can obtain better performance than most of the existing soft subspace clustering algorithms.
基金National Natural Science Foundation of China,Grant/Award Number:61972261Basic Research Foundations of Shenzhen,Grant/Award Numbers:JCYJ20210324093609026,JCYJ20200813091134001。
文摘In this paper,an Observation Points Classifier Ensemble(OPCE)algorithm is proposed to deal with High-Dimensional Imbalanced Classification(HDIC)problems based on data processed using the Multi-Dimensional Scaling(MDS)feature extraction technique.First,dimensionality of the original imbalanced data is reduced using MDS so that distances between any two different samples are preserved as well as possible.Second,a novel OPCE algorithm is applied to classify imbalanced samples by placing optimised observation points in a low-dimensional data space.Third,optimization of the observation point mappings is carried out to obtain a reliable assessment of the unknown samples.Exhaustive experiments have been conducted to evaluate the feasibility,rationality,and effectiveness of the proposed OPCE algorithm using seven benchmark HDIC data sets.Experimental results show that(1)the OPCE algorithm can be trained faster on low-dimensional imbalanced data than on high-dimensional data;(2)the OPCE algorithm can correctly identify samples as the number of optimised observation points is increased;and(3)statistical analysis reveals that OPCE yields better HDIC performances on the selected data sets in comparison with eight other HDIC algorithms.This demonstrates that OPCE is a viable algorithm to deal with HDIC problems.
文摘Baddeleyite is an important mineral geochronometer. It is valued in the U-Pb (ID-TIMS) geochronology more than zircon because of its magmatic origin, while zircon can be metamorphic, hydrothermal or occur as xenocrysts. Detailed mineralogical (BSE, KL, etc.) research of baddeleyite started in the Fennoscandian Shield in the 1990s. The mineral was first extracted from the Paleozoic Kovdor deposit, the second-biggest baddeleyite deposit in the world after Phalaborwa (2.1 Ga), South Africa. The mineral was successfully introduced into the U-Pb systematics. This study provides new U-Pb and LA-ICP-MS data on Archean Ti-Mgt and BIF deposits, Paleoproterozoic layered PGE intrusions with Pt-Pd and Cu-Ni reefs and Paleozoic complex deposits (baddeleyite, apatite, foscorite ores, etc.) in the NE Fennoscandian Shield. Data on concentrations of REE in baddeleyite and temperature of the U-Pb systematics closure are also provided. It is shown that baddeleyite plays an important role in the geological history of the Earth, in particular, in the break-up of supercontinents.
基金supported in part by the National Natural Science Foundation of China(62172065,62072060)。
文摘As a crucial data preprocessing method in data mining,feature selection(FS)can be regarded as a bi-objective optimization problem that aims to maximize classification accuracy and minimize the number of selected features.Evolutionary computing(EC)is promising for FS owing to its powerful search capability.However,in traditional EC-based methods,feature subsets are represented via a length-fixed individual encoding.It is ineffective for high-dimensional data,because it results in a huge search space and prohibitive training time.This work proposes a length-adaptive non-dominated sorting genetic algorithm(LA-NSGA)with a length-variable individual encoding and a length-adaptive evolution mechanism for bi-objective highdimensional FS.In LA-NSGA,an initialization method based on correlation and redundancy is devised to initialize individuals of diverse lengths,and a Pareto dominance-based length change operator is introduced to guide individuals to explore in promising search space adaptively.Moreover,a dominance-based local search method is employed for further improvement.The experimental results based on 12 high-dimensional gene datasets show that the Pareto front of feature subsets produced by LA-NSGA is superior to those of existing algorithms.
文摘The study investigated user experience, display complexity, display type (tables versus graphs), and task difficulty as variables affecting the user’s ability to navigate through complex visual data. A total of 64 participants, 39 undergraduate students (novice users) and 25 graduate students (intermediate-level users) participated in the study. The experimental design was 2 × 2 × 2 × 3 mixed design using two between-subject variables (display complexity, user experience) and two within-subject variables (display format, question difficulty). The results indicated that response time was superior for graphs (relative to tables), especially when the questions were difficult. The intermediate users seemed to adopt more extensive search strategies than novices, as revealed by an analysis of the number of changes they made to the display prior to answering questions. It was concluded that designers of data displays should consider the (a) type of display, (b) difficulty of the task, and (c) expertise level of the user to obtain optimal levels of performance.
文摘Complex survey designs often involve unequal selection probabilities of clus-ters or units within clusters. When estimating models for complex survey data, scaled weights are incorporated into the likelihood, producing a pseudo likeli-hood. In a 3-level weighted analysis for a binary outcome, we implemented two methods for scaling the sampling weights in the National Health Survey of Pa-kistan (NHSP). For NHSP with health care utilization as a binary outcome we found age, gender, household (HH) goods, urban/rural status, community de-velopment index, province and marital status as significant predictors of health care utilization (p-value < 0.05). The variance of the random intercepts using scaling method 1 is estimated as 0.0961 (standard error 0.0339) for PSU level, and 0.2726 (standard error 0.0995) for household level respectively. Both esti-mates are significantly different from zero (p-value < 0.05) and indicate consid-erable heterogeneity in health care utilization with respect to households and PSUs. The results of the NHSP data analysis showed that all three analyses, weighted (two scaling methods) and un-weighted, converged to almost identical results with few exceptions. This may have occurred because of the large num-ber of 3rd and 2nd level clusters and relatively small ICC. We performed a sim-ulation study to assess the effect of varying prevalence and intra-class correla-tion coefficients (ICCs) on bias of fixed effect parameters and variance components of a multilevel pseudo maximum likelihood (weighted) analysis. The simulation results showed that the performance of the scaled weighted estimators is satisfactory for both scaling methods. Incorporating simulation into the analysis of complex multilevel surveys allows the integrity of the results to be tested and is recommended as good practice.
文摘In studies of HIV, interval-censored data occur naturally. HIV infection time is not usually known exactly, only that it occurred before the survey, within some time interval or has not occurred at the time of the survey. Infections are often clustered within geographical areas such as enumerator areas (EAs) and thus inducing unobserved frailty. In this paper we consider an approach for estimating parameters when infection time is unknown and assumed correlated within an EA where dependency is modeled as frailties assuming a normal distribution for frailties and a Weibull distribution for baseline hazards. The data was from a household based population survey that used a multi-stage stratified sample design to randomly select 23,275 interviewed individuals from 10,584 households of whom 15,851 interviewed individuals were further tested for HIV (crude prevalence = 9.1%). A further test conducted among those that tested HIV positive found 181 (12.5%) recently infected. Results show high degree of heterogeneity in HIV distribution between EAs translating to a modest correlation of 0.198. Intervention strategies should target geographical areas that contribute disproportionately to the epidemic of HIV. Further research needs to identify such hot spot areas and understand what factors make these areas prone to HIV.