The Qilian Mountains, a national key ecological function zone in Western China, play a pivotal role in ecosystem services. However, the distribution of its dominant tree species, Picea crassifolia (Qinghai spruce), ha...The Qilian Mountains, a national key ecological function zone in Western China, play a pivotal role in ecosystem services. However, the distribution of its dominant tree species, Picea crassifolia (Qinghai spruce), has decreased dramatically in the past decades due to climate change and human activity, which may have influenced its ecological functions. To restore its ecological functions, reasonable reforestation is the key measure. Many previous efforts have predicted the potential distribution of Picea crassifolia, which provides guidance on regional reforestation policy. However, all of them were performed at low spatial resolution, thus ignoring the natural characteristics of the patchy distribution of Picea crassifolia. Here, we modeled the distribution of Picea crassifolia with species distribution models at high spatial resolutions. For many models, the area under the receiver operating characteristic curve (AUC) is larger than 0.9, suggesting their excellent precision. The AUC of models at 30 m is higher than that of models at 90 m, and the current potential distribution of Picea crassifolia is more closely aligned with its actual distribution at 30 m, demonstrating that finer data resolution improves model performance. Besides, for models at 90 m resolution, annual precipitation (Bio12) played the paramount influence on the distribution of Picea crassifolia, while the aspect became the most important one at 30 m, indicating the crucial role of finer topographic data in modeling species with patchy distribution. The current distribution of Picea crassifolia was concentrated in the northern and central parts of the study area, and this pattern will be maintained under future scenarios, although some habitat loss in the central parts and gain in the eastern regions is expected owing to increasing temperatures and precipitation. Our findings can guide protective and restoration strategies for the Qilian Mountains, which would benefit regional ecological balance.展开更多
Sharing data while protecting privacy in the industrial Internet is a significant challenge.Traditional machine learning methods require a combination of all data for training;however,this approach can be limited by d...Sharing data while protecting privacy in the industrial Internet is a significant challenge.Traditional machine learning methods require a combination of all data for training;however,this approach can be limited by data availability and privacy concerns.Federated learning(FL)has gained considerable attention because it allows for decentralized training on multiple local datasets.However,the training data collected by data providers are often non-independent and identically distributed(non-IID),resulting in poor FL performance.This paper proposes a privacy-preserving approach for sharing non-IID data in the industrial Internet using an FL approach based on blockchain technology.To overcome the problem of non-IID data leading to poor training accuracy,we propose dynamically updating the local model based on the divergence of the global and local models.This approach can significantly improve the accuracy of FL training when there is relatively large dispersion.In addition,we design a dynamic gradient clipping algorithm to alleviate the influence of noise on the model accuracy to reduce potential privacy leakage caused by sharing model parameters.Finally,we evaluate the performance of the proposed scheme using commonly used open-source image datasets.The simulation results demonstrate that our method can significantly enhance the accuracy while protecting privacy and maintaining efficiency,thereby providing a new solution to data-sharing and privacy-protection challenges in the industrial Internet.展开更多
The security of Federated Learning(FL)/Distributed Machine Learning(DML)is gravely threatened by data poisoning attacks,which destroy the usability of the model by contaminating training samples,so such attacks are ca...The security of Federated Learning(FL)/Distributed Machine Learning(DML)is gravely threatened by data poisoning attacks,which destroy the usability of the model by contaminating training samples,so such attacks are called causative availability indiscriminate attacks.Facing the problem that existing data sanitization methods are hard to apply to real-time applications due to their tedious process and heavy computations,we propose a new supervised batch detection method for poison,which can fleetly sanitize the training dataset before the local model training.We design a training dataset generation method that helps to enhance accuracy and uses data complexity features to train a detection model,which will be used in an efficient batch hierarchical detection process.Our model stockpiles knowledge about poison,which can be expanded by retraining to adapt to new attacks.Being neither attack-specific nor scenario-specific,our method is applicable to FL/DML or other online or offline scenarios.展开更多
The scale and complexity of big data are growing continuously,posing severe challenges to traditional data processing methods,especially in the field of clustering analysis.To address this issue,this paper introduces ...The scale and complexity of big data are growing continuously,posing severe challenges to traditional data processing methods,especially in the field of clustering analysis.To address this issue,this paper introduces a new method named Big Data Tensor Multi-Cluster Distributed Incremental Update(BDTMCDIncreUpdate),which combines distributed computing,storage technology,and incremental update techniques to provide an efficient and effective means for clustering analysis.Firstly,the original dataset is divided into multiple subblocks,and distributed computing resources are utilized to process the sub-blocks in parallel,enhancing efficiency.Then,initial clustering is performed on each sub-block using tensor-based multi-clustering techniques to obtain preliminary results.When new data arrives,incremental update technology is employed to update the core tensor and factor matrix,ensuring that the clustering model can adapt to changes in data.Finally,by combining the updated core tensor and factor matrix with historical computational results,refined clustering results are obtained,achieving real-time adaptation to dynamic data.Through experimental simulation on the Aminer dataset,the BDTMCDIncreUpdate method has demonstrated outstanding performance in terms of accuracy(ACC)and normalized mutual information(NMI)metrics,achieving an accuracy rate of 90%and an NMI score of 0.85,which outperforms existing methods such as TClusInitUpdate and TKLClusUpdate in most scenarios.Therefore,the BDTMCDIncreUpdate method offers an innovative solution to the field of big data analysis,integrating distributed computing,incremental updates,and tensor-based multi-clustering techniques.It not only improves the efficiency and scalability in processing large-scale high-dimensional datasets but also has been validated for its effectiveness and accuracy through experiments.This method shows great potential in real-world applications where dynamic data growth is common,and it is of significant importance for advancing the development of data analysis technology.展开更多
Due to the restricted satellite payloads in LEO mega-constellation networks(LMCNs),remote sensing image analysis,online learning and other big data services desirably need onboard distributed processing(OBDP).In exist...Due to the restricted satellite payloads in LEO mega-constellation networks(LMCNs),remote sensing image analysis,online learning and other big data services desirably need onboard distributed processing(OBDP).In existing technologies,the efficiency of big data applications(BDAs)in distributed systems hinges on the stable-state and low-latency links between worker nodes.However,LMCNs with high-dynamic nodes and long-distance links can not provide the above conditions,which makes the performance of OBDP hard to be intuitively measured.To bridge this gap,a multidimensional simulation platform is indispensable that can simulate the network environment of LMCNs and put BDAs in it for performance testing.Using STK's APIs and parallel computing framework,we achieve real-time simulation for thousands of satellite nodes,which are mapped as application nodes through software defined network(SDN)and container technologies.We elaborate the architecture and mechanism of the simulation platform,and take the Starlink and Hadoop as realistic examples for simulations.The results indicate that LMCNs have dynamic end-to-end latency which fluctuates periodically with the constellation movement.Compared to ground data center networks(GDCNs),LMCNs deteriorate the computing and storage job throughput,which can be alleviated by the utilization of erasure codes and data flow scheduling of worker nodes.展开更多
Multimodal sentiment analysis utilizes multimodal data such as text,facial expressions and voice to detect people’s attitudes.With the advent of distributed data collection and annotation,we can easily obtain and sha...Multimodal sentiment analysis utilizes multimodal data such as text,facial expressions and voice to detect people’s attitudes.With the advent of distributed data collection and annotation,we can easily obtain and share such multimodal data.However,due to professional discrepancies among annotators and lax quality control,noisy labels might be introduced.Recent research suggests that deep neural networks(DNNs)will overfit noisy labels,leading to the poor performance of the DNNs.To address this challenging problem,we present a Multimodal Robust Meta Learning framework(MRML)for multimodal sentiment analysis to resist noisy labels and correlate distinct modalities simultaneously.Specifically,we propose a two-layer fusion net to deeply fuse different modalities and improve the quality of the multimodal data features for label correction and network training.Besides,a multiple meta-learner(label corrector)strategy is proposed to enhance the label correction approach and prevent models from overfitting to noisy labels.We conducted experiments on three popular multimodal datasets to verify the superiority of ourmethod by comparing it with four baselines.展开更多
Traditional distribution network planning relies on the professional knowledge of planners,especially when analyzing the correlations between the problems existing in the network and the crucial influencing factors.Th...Traditional distribution network planning relies on the professional knowledge of planners,especially when analyzing the correlations between the problems existing in the network and the crucial influencing factors.The inherent laws reflected by the historical data of the distribution network are ignored,which affects the objectivity of the planning scheme.In this study,to improve the efficiency and accuracy of distribution network planning,the characteristics of distribution network data were extracted using a data-mining technique,and correlation knowledge of existing problems in the network was obtained.A data-mining model based on correlation rules was established.The inputs of the model were the electrical characteristic indices screened using the gray correlation method.The Apriori algorithm was used to extract correlation knowledge from the operational data of the distribution network and obtain strong correlation rules.Degree of promotion and chi-square tests were used to verify the rationality of the strong correlation rules of the model output.In this study,the correlation relationship between heavy load or overload problems of distribution network feeders in different regions and related characteristic indices was determined,and the confidence of the correlation rules was obtained.These results can provide an effective basis for the formulation of a distribution network planning scheme.展开更多
The fitting of lifetime distribution in real-life data has been studied in various fields of research. With the theory of evolution still applicable, more complex data from real-world scenarios will continue to emerge...The fitting of lifetime distribution in real-life data has been studied in various fields of research. With the theory of evolution still applicable, more complex data from real-world scenarios will continue to emerge. Despite this, many researchers have made commendable efforts to develop new lifetime distributions that can fit this complex data. In this paper, we utilized the KM-transformation technique to increase the flexibility of the power Lindley distribution, resulting in the Kavya-Manoharan Power Lindley (KMPL) distribution. We study the mathematical treatments of the KMPL distribution in detail and adapt the widely used method of maximum likelihood to estimate the unknown parameters of the KMPL distribution. We carry out a Monte Carlo simulation study to investigate the performance of the Maximum Likelihood Estimates (MLEs) of the parameters of the KMPL distribution. To demonstrate the effectiveness of the KMPL distribution for data fitting, we use a real dataset comprising the waiting time of 100 bank customers. We compare the KMPL distribution with other models that are extensions of the power Lindley distribution. Based on some statistical model selection criteria, the summary results of the analysis were in favor of the KMPL distribution. We further investigate the density fit and probability-probability (p-p) plots to validate the superiority of the KMPL distribution over the competing distributions for fitting the waiting time dataset.展开更多
Operation control of power systems has become challenging with an increase in the scale and complexity of power distribution systems and extensive access to renewable energy.Therefore,improvement of the ability of dat...Operation control of power systems has become challenging with an increase in the scale and complexity of power distribution systems and extensive access to renewable energy.Therefore,improvement of the ability of data-driven operation management,intelligent analysis,and mining is urgently required.To investigate and explore similar regularities of the historical operating section of the power distribution system and assist the power grid in obtaining high-value historical operation,maintenance experience,and knowledge by rule and line,a neural information retrieval model with an attention mechanism is proposed based on graph data computing technology.Based on the processing flow of the operating data of the power distribution system,a technical framework of neural information retrieval is established.Combined with the natural graph characteristics of the power distribution system,a unified graph data structure and a data fusion method of data access,data complement,and multi-source data are constructed.Further,a graph node feature-embedding representation learning algorithm and a neural information retrieval algorithm model are constructed.The neural information retrieval algorithm model is trained and tested using the generated graph node feature representation vector set.The model is verified on the operating section of the power distribution system of a provincial grid area.The results show that the proposed method demonstrates high accuracy in the similarity matching of historical operation characteristics and effectively supports intelligent fault diagnosis and elimination in power distribution systems.展开更多
Distribution networks denote important public infrastructure necessary for people’s livelihoods.However,extreme natural disasters,such as earthquakes,typhoons,and mudslides,severely threaten the safe and stable opera...Distribution networks denote important public infrastructure necessary for people’s livelihoods.However,extreme natural disasters,such as earthquakes,typhoons,and mudslides,severely threaten the safe and stable operation of distribution networks and power supplies needed for daily life.Therefore,considering the requirements for distribution network disaster prevention and mitigation,there is an urgent need for in-depth research on risk assessment methods of distribution networks under extreme natural disaster conditions.This paper accessesmultisource data,presents the data quality improvement methods of distribution networks,and conducts data-driven active fault diagnosis and disaster damage analysis and evaluation using data-driven theory.Furthermore,the paper realizes real-time,accurate access to distribution network disaster information.The proposed approach performs an accurate and rapid assessment of cross-sectional risk through case study.The minimal average annual outage time can be reduced to 3 h/a in the ring network through case study.The approach proposed in this paper can provide technical support to the further improvement of the ability of distribution networks to cope with extreme natural disasters.展开更多
In several fields like financial dealing,industry,business,medicine,et cetera,Big Data(BD)has been utilized extensively,which is nothing but a collection of a huge amount of data.However,it is highly complicated alon...In several fields like financial dealing,industry,business,medicine,et cetera,Big Data(BD)has been utilized extensively,which is nothing but a collection of a huge amount of data.However,it is highly complicated along with time-consuming to process a massive amount of data.Thus,to design the Distribution Preserving Framework for BD,a novel methodology has been proposed utilizing Manhattan Distance(MD)-centered Partition Around Medoid(MD–PAM)along with Conjugate Gradient Artificial Neural Network(CG-ANN),which undergoes various steps to reduce the complications of BD.Firstly,the data are processed in the pre-processing phase by mitigating the data repetition utilizing the map-reduce function;subsequently,the missing data are handled by substituting or by ignoring the missed values.After that,the data are transmuted into a normalized form.Next,to enhance the classification performance,the data’s dimensionalities are minimized by employing Gaussian Kernel(GK)-Fisher Discriminant Analysis(GK-FDA).Afterwards,the processed data is submitted to the partitioning phase after transmuting it into a structured format.In the partition phase,by utilizing the MD-PAM,the data are partitioned along with grouped into a cluster.Lastly,by employing CG-ANN,the data are classified in the classification phase so that the needed data can be effortlessly retrieved by the user.To analogize the outcomes of the CG-ANN with the prevailing methodologies,the NSL-KDD openly accessible datasets are utilized.The experiential outcomes displayed that an efficient result along with a reduced computation cost was shown by the proposed CG-ANN.The proposed work outperforms well in terms of accuracy,sensitivity and specificity than the existing systems.展开更多
To improve data distribution efficiency a load-balancing data distribution LBDD method is proposed in publish/subscribe mode.In the LBDD method subscribers are involved in distribution tasks and data transfers while r...To improve data distribution efficiency a load-balancing data distribution LBDD method is proposed in publish/subscribe mode.In the LBDD method subscribers are involved in distribution tasks and data transfers while receiving data themselves.A dissemination tree is constructed among the subscribers based on MD5 where the publisher acts as the root. The proposed method provides bucket construction target selection and path updates furthermore the property of one-way dissemination is proven.That the average out-going degree of a node is 2 is guaranteed with the proposed LBDD.The experiments on data distribution delay data distribution rate and load distribution are conducted. Experimental results show that the LBDD method aids in shaping the task load between the publisher and subscribers and outperforms the point-to-point approach.展开更多
The existence of three well-defined tongue-shaped zones of swell dominance,termed as 'swell pools',in the Pacific,the Atlantic and the Indian Oceans,was reported by Chen et al.(2002)using satellite data.In thi...The existence of three well-defined tongue-shaped zones of swell dominance,termed as 'swell pools',in the Pacific,the Atlantic and the Indian Oceans,was reported by Chen et al.(2002)using satellite data.In this paper,the ECMWF Re-analyses wind wave data,including wind speed,significant wave height,averaged wave period and direction,are applied to verify the existence of these swell pools.The swell indices calculated from wave height,wave age and correlation coefficient are used to identify swell events.The wave age swell index can be more appropriately related to physical processes compared to the other two swell indices.Based on the ECMWF data the swell pools in the Pacific and the Atlantic Oceans are confirmed,but the expected swell pool in the Indian Ocean is not pronounced.The seasonal variations of global and hemispherical swell indices are investigated,and the argument that swells in the pools seemed to originate mostly from the winter hemisphere is supported by the seasonal variation of the averaged wave direction.The northward bending of the swell pools in the Pacific and the Atlantic Oceans in summer is not revealed by the ECMWF data.The swell pool in the Indian Ocean and the summer northward bending of the swell pools in the Pacific and the Atlan-tic Oceans need to be further verified by other datasets.展开更多
Net Primary Productivity (NPP) is one of the important biophysical variables of vegetation activity, and it plays an important role in studying global carbon cycle, carbon source and sink of ecosystem, and spatial a...Net Primary Productivity (NPP) is one of the important biophysical variables of vegetation activity, and it plays an important role in studying global carbon cycle, carbon source and sink of ecosystem, and spatial and temporal distribution of CO2. Remote sensing can provide broad view quickly, timely and multi-temporally, which makes it an attractive and powerful tool for studying ecosystem primary productivity, at scales ranging from local to global. This paper aims to use Moderate Resolution Imaging Spectroradiometer (MODIS) data to estimate and analyze spatial and temporal distribution of NPP of the northern Hebei Province in 2001 based on Carnegie-Ames-Stanford Approach (CASA) model. The spatial distribution of Absorbed Photosynthetically Active Radiation (APAR) of vegetation and light use efficiency in three geographical subregions, that is, Bashang Plateau Region, Basin Region in the northwestern Hebei Province and Yanshan Mountainous Region in the Northern Hebei Province were analyzed, and total NPP spatial distribution of the study area in 2001 was discussed. Based on 16-day MODIS Fraction of Photosynthetically Active Radiation absorbed by vegetation (FPAR) product, 16-day composite NPP dynamics were calculated using CASA model; the seasonal dynamics of vegetation NPP in three subreglons were also analyzed. Result reveals that the total NPP of the study area in 2001 was 25.1877 × 10^6gC/(m^2.a), and NPP in 2001 ranged from 2 to 608gC/(m^2-a), with an average of 337.516gC/(m^2.a). NPP of the study area in 2001 accumulated mainly from May to September (DOY 129-272), high NIP values appeared from June to August (DOY 177-204), and the maximum NPP appeared from late July to mid-August (DOY 209-224).展开更多
In this article,we highlight a new three-parameter heavy-tailed lifetime distribution that aims to extend the modeling possibilities of the Lomax distribution.It is called the extended Lomax distribution.The considere...In this article,we highlight a new three-parameter heavy-tailed lifetime distribution that aims to extend the modeling possibilities of the Lomax distribution.It is called the extended Lomax distribution.The considered distribution naturally appears as the distribution of a transformation of a random variable following the logweighted power distribution recently introduced for percentage or proportion data analysis purposes.As a result,its cumulative distribution has the same functional basis as that of the Lomax distribution,but with a novel special logarithmic term depending on several parameters.The modulation of this logarithmic term reveals new types of asymetrical shapes,implying a modeling horizon beyond that of the Lomax distribution.In the first part,we examine several of its mathematical properties,such as the shapes of the related probability and hazard rate functions;stochastic comparisons;manageable expansions for various moments;and quantile properties.In particular,based on the quantile functions,various actuarial measures are discussed.In the second part,the distribution’s applicability is investigated with the use of themaximumlikelihood estimationmethod.The behavior of the obtained parameter estimates is validated by a simulation work.Insurance claim data are analyzed.We show that the proposed distribution outperforms eight well-known distributions,including the Lomax distribution and several extended Lomax distributions.In addition,we demonstrate that it gives preferable inferences from these competitor distributions in terms of risk measures.展开更多
It is crucial,while using healthcare data,to assess the advantages of data privacy against the possible drawbacks.Data from several sources must be combined for use in many data mining applications.The medical practit...It is crucial,while using healthcare data,to assess the advantages of data privacy against the possible drawbacks.Data from several sources must be combined for use in many data mining applications.The medical practitioner may use the results of association rule mining performed on this aggregated data to better personalize patient care and implement preventive measures.Historically,numerous heuristics(e.g.,greedy search)and metaheuristics-based techniques(e.g.,evolutionary algorithm)have been created for the positive association rule in privacy preserving data mining(PPDM).When it comes to connecting seemingly unrelated diseases and drugs,negative association rules may be more informative than their positive counterparts.It is well-known that during negative association rules mining,a large number of uninteresting rules are formed,making this a difficult problem to tackle.In this research,we offer an adaptive method for negative association rule mining in vertically partitioned healthcare datasets that respects users’privacy.The applied approach dynamically determines the transactions to be interrupted for information hiding,as opposed to predefining them.This study introduces a novel method for addressing the problem of negative association rules in healthcare data mining,one that is based on the Tabu-genetic optimization paradigm.Tabu search is advantageous since it removes a huge number of unnecessary rules and item sets.Experiments using benchmark healthcare datasets prove that the discussed scheme outperforms state-of-the-art solutions in terms of decreasing side effects and data distortions,as measured by the indicator of hiding failure.展开更多
A new method of establishing rolling load distribution model was developed by online intelligent information-processing technology for plate rolling. The model combines knowledge model and mathematical model with usin...A new method of establishing rolling load distribution model was developed by online intelligent information-processing technology for plate rolling. The model combines knowledge model and mathematical model with using knowledge discovery in database (KDD) and data mining (DM) as the start. The online maintenance and optimization of the load model are realized. The effectiveness of this new method was testified by offline simulation and online application.展开更多
In the data retrieval process of the Data recommendation system,the matching prediction and similarity identification take place a major role in the ontology.In that,there are several methods to improve the retrieving...In the data retrieval process of the Data recommendation system,the matching prediction and similarity identification take place a major role in the ontology.In that,there are several methods to improve the retrieving process with improved accuracy and to reduce the searching time.Since,in the data recommendation system,this type of data searching becomes complex to search for the best matching for given query data and fails in the accuracy of the query recommendation process.To improve the performance of data validation,this paper proposed a novel model of data similarity estimation and clustering method to retrieve the relevant data with the best matching in the big data processing.In this paper advanced model of the Logarithmic Directionality Texture Pattern(LDTP)method with a Metaheuristic Pattern Searching(MPS)system was used to estimate the similarity between the query data in the entire database.The overall work was implemented for the application of the data recommendation process.These are all indexed and grouped as a cluster to form a paged format of database structure which can reduce the computation time while at the searching period.Also,with the help of a neural network,the relevancies of feature attributes in the database are predicted,and the matching index was sorted to provide the recommended data for given query data.This was achieved by using the Distributional Recurrent Neural Network(DRNN).This is an enhanced model of Neural Network technology to find the relevancy based on the correlation factor of the feature set.The training process of the DRNN classifier was carried out by estimating the correlation factor of the attributes of the dataset.These are formed as clusters and paged with proper indexing based on the MPS parameter of similarity metric.The overall performance of the proposed work can be evaluated by varying the size of the training database by 60%,70%,and 80%.The parameters that are considered for performance analysis are Precision,Recall,F1-score and the accuracy of data retrieval,the query recommendation output,and comparison with other state-of-art methods.展开更多
Considering that the measurement devices of the distribution network are becoming more and more abundant, on the basis of the traditional Supervisory Control And Data Acquisition (SCADA) measurement system, Phasor mea...Considering that the measurement devices of the distribution network are becoming more and more abundant, on the basis of the traditional Supervisory Control And Data Acquisition (SCADA) measurement system, Phasor measurement unit (PMU) devices are also gradually applied to the distribution network. So when estimating the state of the distribution network, the above two devices need to be used. However, because the data of different measurement systems are different, it is necessary to balance this difference so that the data of different systems can be compatible to achieve the purpose of effective utilization of the estimated power distribution state. To this end, this paper starts with three aspects of data accuracy of the two measurement systems, data time section and data refresh frequency to eliminate the differences between system data, and then considers the actual situation of the three-phase asymmetry of the distribution network. The three-phase state estimation equations are constructed by the branch current method, and finally the state estimation results are solved by the weighted least square method.展开更多
Type-I censoring mechanism arises when the number of units experiencing the event is random but the total duration of the study is fixed. There are a number of mathematical approaches developed to handle this type of ...Type-I censoring mechanism arises when the number of units experiencing the event is random but the total duration of the study is fixed. There are a number of mathematical approaches developed to handle this type of data. The purpose of the research was to estimate the three parameters of the Frechet distribution via the frequentist Maximum Likelihood and the Bayesian Estimators. In this paper, the maximum likelihood method (MLE) is not available of the three parameters in the closed forms;therefore, it was solved by the numerical methods. Similarly, the Bayesian estimators are implemented using Jeffreys and gamma priors with two loss functions, which are: squared error loss function and Linear Exponential Loss Function (LINEX). The parameters of the Frechet distribution via Bayesian cannot be obtained analytically and therefore Markov Chain Monte Carlo is used, where the full conditional distribution for the three parameters is obtained via Metropolis-Hastings algorithm. Comparisons of the estimators are obtained using Mean Square Errors (MSE) to determine the best estimator of the three parameters of the Frechet distribution. The results show that the Bayesian estimation under Linear Exponential Loss Function based on Type-I censored data is a better estimator for all the parameter estimates when the value of the loss parameter is positive.展开更多
基金supported by the National Natural Science Foundation of China(No.42071057).
文摘The Qilian Mountains, a national key ecological function zone in Western China, play a pivotal role in ecosystem services. However, the distribution of its dominant tree species, Picea crassifolia (Qinghai spruce), has decreased dramatically in the past decades due to climate change and human activity, which may have influenced its ecological functions. To restore its ecological functions, reasonable reforestation is the key measure. Many previous efforts have predicted the potential distribution of Picea crassifolia, which provides guidance on regional reforestation policy. However, all of them were performed at low spatial resolution, thus ignoring the natural characteristics of the patchy distribution of Picea crassifolia. Here, we modeled the distribution of Picea crassifolia with species distribution models at high spatial resolutions. For many models, the area under the receiver operating characteristic curve (AUC) is larger than 0.9, suggesting their excellent precision. The AUC of models at 30 m is higher than that of models at 90 m, and the current potential distribution of Picea crassifolia is more closely aligned with its actual distribution at 30 m, demonstrating that finer data resolution improves model performance. Besides, for models at 90 m resolution, annual precipitation (Bio12) played the paramount influence on the distribution of Picea crassifolia, while the aspect became the most important one at 30 m, indicating the crucial role of finer topographic data in modeling species with patchy distribution. The current distribution of Picea crassifolia was concentrated in the northern and central parts of the study area, and this pattern will be maintained under future scenarios, although some habitat loss in the central parts and gain in the eastern regions is expected owing to increasing temperatures and precipitation. Our findings can guide protective and restoration strategies for the Qilian Mountains, which would benefit regional ecological balance.
基金This work was supported by the National Key R&D Program of China under Grant 2023YFB2703802the Hunan Province Innovation and Entrepreneurship Training Program for College Students S202311528073.
文摘Sharing data while protecting privacy in the industrial Internet is a significant challenge.Traditional machine learning methods require a combination of all data for training;however,this approach can be limited by data availability and privacy concerns.Federated learning(FL)has gained considerable attention because it allows for decentralized training on multiple local datasets.However,the training data collected by data providers are often non-independent and identically distributed(non-IID),resulting in poor FL performance.This paper proposes a privacy-preserving approach for sharing non-IID data in the industrial Internet using an FL approach based on blockchain technology.To overcome the problem of non-IID data leading to poor training accuracy,we propose dynamically updating the local model based on the divergence of the global and local models.This approach can significantly improve the accuracy of FL training when there is relatively large dispersion.In addition,we design a dynamic gradient clipping algorithm to alleviate the influence of noise on the model accuracy to reduce potential privacy leakage caused by sharing model parameters.Finally,we evaluate the performance of the proposed scheme using commonly used open-source image datasets.The simulation results demonstrate that our method can significantly enhance the accuracy while protecting privacy and maintaining efficiency,thereby providing a new solution to data-sharing and privacy-protection challenges in the industrial Internet.
基金supported in part by the“Pioneer”and“Leading Goose”R&D Program of Zhejiang(Grant No.2022C03174)the National Natural Science Foundation of China(No.92067103)+4 种基金the Key Research and Development Program of Shaanxi,China(No.2021ZDLGY06-02)the Natural Science Foundation of Shaanxi Province(No.2019ZDLGY12-02)the Shaanxi Innovation Team Project(No.2018TD-007)the Xi'an Science and technology Innovation Plan(No.201809168CX9JC10)the Fundamental Research Funds for the Central Universities(No.YJS2212)and National 111 Program of China B16037.
文摘The security of Federated Learning(FL)/Distributed Machine Learning(DML)is gravely threatened by data poisoning attacks,which destroy the usability of the model by contaminating training samples,so such attacks are called causative availability indiscriminate attacks.Facing the problem that existing data sanitization methods are hard to apply to real-time applications due to their tedious process and heavy computations,we propose a new supervised batch detection method for poison,which can fleetly sanitize the training dataset before the local model training.We design a training dataset generation method that helps to enhance accuracy and uses data complexity features to train a detection model,which will be used in an efficient batch hierarchical detection process.Our model stockpiles knowledge about poison,which can be expanded by retraining to adapt to new attacks.Being neither attack-specific nor scenario-specific,our method is applicable to FL/DML or other online or offline scenarios.
基金sponsored by the National Natural Science Foundation of China(Nos.61972208,62102194 and 62102196)National Natural Science Foundation of China(Youth Project)(No.62302237)+3 种基金Six Talent Peaks Project of Jiangsu Province(No.RJFW-111),China Postdoctoral Science Foundation Project(No.2018M640509)Postgraduate Research and Practice Innovation Program of Jiangsu Province(Nos.KYCX22_1019,KYCX23_1087,KYCX22_1027,KYCX23_1087,SJCX24_0339 and SJCX24_0346)Innovative Training Program for College Students of Nanjing University of Posts and Telecommunications(No.XZD2019116)Nanjing University of Posts and Telecommunications College Students Innovation Training Program(Nos.XZD2019116,XYB2019331).
文摘The scale and complexity of big data are growing continuously,posing severe challenges to traditional data processing methods,especially in the field of clustering analysis.To address this issue,this paper introduces a new method named Big Data Tensor Multi-Cluster Distributed Incremental Update(BDTMCDIncreUpdate),which combines distributed computing,storage technology,and incremental update techniques to provide an efficient and effective means for clustering analysis.Firstly,the original dataset is divided into multiple subblocks,and distributed computing resources are utilized to process the sub-blocks in parallel,enhancing efficiency.Then,initial clustering is performed on each sub-block using tensor-based multi-clustering techniques to obtain preliminary results.When new data arrives,incremental update technology is employed to update the core tensor and factor matrix,ensuring that the clustering model can adapt to changes in data.Finally,by combining the updated core tensor and factor matrix with historical computational results,refined clustering results are obtained,achieving real-time adaptation to dynamic data.Through experimental simulation on the Aminer dataset,the BDTMCDIncreUpdate method has demonstrated outstanding performance in terms of accuracy(ACC)and normalized mutual information(NMI)metrics,achieving an accuracy rate of 90%and an NMI score of 0.85,which outperforms existing methods such as TClusInitUpdate and TKLClusUpdate in most scenarios.Therefore,the BDTMCDIncreUpdate method offers an innovative solution to the field of big data analysis,integrating distributed computing,incremental updates,and tensor-based multi-clustering techniques.It not only improves the efficiency and scalability in processing large-scale high-dimensional datasets but also has been validated for its effectiveness and accuracy through experiments.This method shows great potential in real-world applications where dynamic data growth is common,and it is of significant importance for advancing the development of data analysis technology.
基金supported by National Natural Sciences Foundation of China(No.62271165,62027802,62201307)the Guangdong Basic and Applied Basic Research Foundation(No.2023A1515030297)+2 种基金the Shenzhen Science and Technology Program ZDSYS20210623091808025Stable Support Plan Program GXWD20231129102638002the Major Key Project of PCL(No.PCL2024A01)。
文摘Due to the restricted satellite payloads in LEO mega-constellation networks(LMCNs),remote sensing image analysis,online learning and other big data services desirably need onboard distributed processing(OBDP).In existing technologies,the efficiency of big data applications(BDAs)in distributed systems hinges on the stable-state and low-latency links between worker nodes.However,LMCNs with high-dynamic nodes and long-distance links can not provide the above conditions,which makes the performance of OBDP hard to be intuitively measured.To bridge this gap,a multidimensional simulation platform is indispensable that can simulate the network environment of LMCNs and put BDAs in it for performance testing.Using STK's APIs and parallel computing framework,we achieve real-time simulation for thousands of satellite nodes,which are mapped as application nodes through software defined network(SDN)and container technologies.We elaborate the architecture and mechanism of the simulation platform,and take the Starlink and Hadoop as realistic examples for simulations.The results indicate that LMCNs have dynamic end-to-end latency which fluctuates periodically with the constellation movement.Compared to ground data center networks(GDCNs),LMCNs deteriorate the computing and storage job throughput,which can be alleviated by the utilization of erasure codes and data flow scheduling of worker nodes.
基金supported by STI 2030-Major Projects 2021ZD0200400National Natural Science Foundation of China(62276233 and 62072405)Key Research Project of Zhejiang Province(2023C01048).
文摘Multimodal sentiment analysis utilizes multimodal data such as text,facial expressions and voice to detect people’s attitudes.With the advent of distributed data collection and annotation,we can easily obtain and share such multimodal data.However,due to professional discrepancies among annotators and lax quality control,noisy labels might be introduced.Recent research suggests that deep neural networks(DNNs)will overfit noisy labels,leading to the poor performance of the DNNs.To address this challenging problem,we present a Multimodal Robust Meta Learning framework(MRML)for multimodal sentiment analysis to resist noisy labels and correlate distinct modalities simultaneously.Specifically,we propose a two-layer fusion net to deeply fuse different modalities and improve the quality of the multimodal data features for label correction and network training.Besides,a multiple meta-learner(label corrector)strategy is proposed to enhance the label correction approach and prevent models from overfitting to noisy labels.We conducted experiments on three popular multimodal datasets to verify the superiority of ourmethod by comparing it with four baselines.
基金supported by the Science and Technology Project of China Southern Power Grid(GZHKJXM20210043-080041KK52210002).
文摘Traditional distribution network planning relies on the professional knowledge of planners,especially when analyzing the correlations between the problems existing in the network and the crucial influencing factors.The inherent laws reflected by the historical data of the distribution network are ignored,which affects the objectivity of the planning scheme.In this study,to improve the efficiency and accuracy of distribution network planning,the characteristics of distribution network data were extracted using a data-mining technique,and correlation knowledge of existing problems in the network was obtained.A data-mining model based on correlation rules was established.The inputs of the model were the electrical characteristic indices screened using the gray correlation method.The Apriori algorithm was used to extract correlation knowledge from the operational data of the distribution network and obtain strong correlation rules.Degree of promotion and chi-square tests were used to verify the rationality of the strong correlation rules of the model output.In this study,the correlation relationship between heavy load or overload problems of distribution network feeders in different regions and related characteristic indices was determined,and the confidence of the correlation rules was obtained.These results can provide an effective basis for the formulation of a distribution network planning scheme.
文摘The fitting of lifetime distribution in real-life data has been studied in various fields of research. With the theory of evolution still applicable, more complex data from real-world scenarios will continue to emerge. Despite this, many researchers have made commendable efforts to develop new lifetime distributions that can fit this complex data. In this paper, we utilized the KM-transformation technique to increase the flexibility of the power Lindley distribution, resulting in the Kavya-Manoharan Power Lindley (KMPL) distribution. We study the mathematical treatments of the KMPL distribution in detail and adapt the widely used method of maximum likelihood to estimate the unknown parameters of the KMPL distribution. We carry out a Monte Carlo simulation study to investigate the performance of the Maximum Likelihood Estimates (MLEs) of the parameters of the KMPL distribution. To demonstrate the effectiveness of the KMPL distribution for data fitting, we use a real dataset comprising the waiting time of 100 bank customers. We compare the KMPL distribution with other models that are extensions of the power Lindley distribution. Based on some statistical model selection criteria, the summary results of the analysis were in favor of the KMPL distribution. We further investigate the density fit and probability-probability (p-p) plots to validate the superiority of the KMPL distribution over the competing distributions for fitting the waiting time dataset.
基金supported by the National Key R&D Program of China(2020YFB0905900).
文摘Operation control of power systems has become challenging with an increase in the scale and complexity of power distribution systems and extensive access to renewable energy.Therefore,improvement of the ability of data-driven operation management,intelligent analysis,and mining is urgently required.To investigate and explore similar regularities of the historical operating section of the power distribution system and assist the power grid in obtaining high-value historical operation,maintenance experience,and knowledge by rule and line,a neural information retrieval model with an attention mechanism is proposed based on graph data computing technology.Based on the processing flow of the operating data of the power distribution system,a technical framework of neural information retrieval is established.Combined with the natural graph characteristics of the power distribution system,a unified graph data structure and a data fusion method of data access,data complement,and multi-source data are constructed.Further,a graph node feature-embedding representation learning algorithm and a neural information retrieval algorithm model are constructed.The neural information retrieval algorithm model is trained and tested using the generated graph node feature representation vector set.The model is verified on the operating section of the power distribution system of a provincial grid area.The results show that the proposed method demonstrates high accuracy in the similarity matching of historical operation characteristics and effectively supports intelligent fault diagnosis and elimination in power distribution systems.
文摘Distribution networks denote important public infrastructure necessary for people’s livelihoods.However,extreme natural disasters,such as earthquakes,typhoons,and mudslides,severely threaten the safe and stable operation of distribution networks and power supplies needed for daily life.Therefore,considering the requirements for distribution network disaster prevention and mitigation,there is an urgent need for in-depth research on risk assessment methods of distribution networks under extreme natural disaster conditions.This paper accessesmultisource data,presents the data quality improvement methods of distribution networks,and conducts data-driven active fault diagnosis and disaster damage analysis and evaluation using data-driven theory.Furthermore,the paper realizes real-time,accurate access to distribution network disaster information.The proposed approach performs an accurate and rapid assessment of cross-sectional risk through case study.The minimal average annual outage time can be reduced to 3 h/a in the ring network through case study.The approach proposed in this paper can provide technical support to the further improvement of the ability of distribution networks to cope with extreme natural disasters.
文摘In several fields like financial dealing,industry,business,medicine,et cetera,Big Data(BD)has been utilized extensively,which is nothing but a collection of a huge amount of data.However,it is highly complicated along with time-consuming to process a massive amount of data.Thus,to design the Distribution Preserving Framework for BD,a novel methodology has been proposed utilizing Manhattan Distance(MD)-centered Partition Around Medoid(MD–PAM)along with Conjugate Gradient Artificial Neural Network(CG-ANN),which undergoes various steps to reduce the complications of BD.Firstly,the data are processed in the pre-processing phase by mitigating the data repetition utilizing the map-reduce function;subsequently,the missing data are handled by substituting or by ignoring the missed values.After that,the data are transmuted into a normalized form.Next,to enhance the classification performance,the data’s dimensionalities are minimized by employing Gaussian Kernel(GK)-Fisher Discriminant Analysis(GK-FDA).Afterwards,the processed data is submitted to the partitioning phase after transmuting it into a structured format.In the partition phase,by utilizing the MD-PAM,the data are partitioned along with grouped into a cluster.Lastly,by employing CG-ANN,the data are classified in the classification phase so that the needed data can be effortlessly retrieved by the user.To analogize the outcomes of the CG-ANN with the prevailing methodologies,the NSL-KDD openly accessible datasets are utilized.The experiential outcomes displayed that an efficient result along with a reduced computation cost was shown by the proposed CG-ANN.The proposed work outperforms well in terms of accuracy,sensitivity and specificity than the existing systems.
基金The National Key Basic Research Program of China(973 Program)
文摘To improve data distribution efficiency a load-balancing data distribution LBDD method is proposed in publish/subscribe mode.In the LBDD method subscribers are involved in distribution tasks and data transfers while receiving data themselves.A dissemination tree is constructed among the subscribers based on MD5 where the publisher acts as the root. The proposed method provides bucket construction target selection and path updates furthermore the property of one-way dissemination is proven.That the average out-going degree of a node is 2 is guaranteed with the proposed LBDD.The experiments on data distribution delay data distribution rate and load distribution are conducted. Experimental results show that the LBDD method aids in shaping the task load between the publisher and subscribers and outperforms the point-to-point approach.
基金the National Natural Science Foundation of China (Nos. 40830959 and 40921004)the Ministry of Science and Technology of China (No. 2011BAC03B01)
文摘The existence of three well-defined tongue-shaped zones of swell dominance,termed as 'swell pools',in the Pacific,the Atlantic and the Indian Oceans,was reported by Chen et al.(2002)using satellite data.In this paper,the ECMWF Re-analyses wind wave data,including wind speed,significant wave height,averaged wave period and direction,are applied to verify the existence of these swell pools.The swell indices calculated from wave height,wave age and correlation coefficient are used to identify swell events.The wave age swell index can be more appropriately related to physical processes compared to the other two swell indices.Based on the ECMWF data the swell pools in the Pacific and the Atlantic Oceans are confirmed,but the expected swell pool in the Indian Ocean is not pronounced.The seasonal variations of global and hemispherical swell indices are investigated,and the argument that swells in the pools seemed to originate mostly from the winter hemisphere is supported by the seasonal variation of the averaged wave direction.The northward bending of the swell pools in the Pacific and the Atlantic Oceans in summer is not revealed by the ECMWF data.The swell pool in the Indian Ocean and the summer northward bending of the swell pools in the Pacific and the Atlan-tic Oceans need to be further verified by other datasets.
基金Under the auspices of the National Natural Science Foundation of China (No. 40571117), the Knowledge Innovation Program of Chinese Academy of Sciences (No. KZCX3-SW-338), Research foundation of the State Key Laboratory of Remote Sensing Science, Institute of Remote Sensing Applications, Chinese Academy of Sciences (KQ060006)
文摘Net Primary Productivity (NPP) is one of the important biophysical variables of vegetation activity, and it plays an important role in studying global carbon cycle, carbon source and sink of ecosystem, and spatial and temporal distribution of CO2. Remote sensing can provide broad view quickly, timely and multi-temporally, which makes it an attractive and powerful tool for studying ecosystem primary productivity, at scales ranging from local to global. This paper aims to use Moderate Resolution Imaging Spectroradiometer (MODIS) data to estimate and analyze spatial and temporal distribution of NPP of the northern Hebei Province in 2001 based on Carnegie-Ames-Stanford Approach (CASA) model. The spatial distribution of Absorbed Photosynthetically Active Radiation (APAR) of vegetation and light use efficiency in three geographical subregions, that is, Bashang Plateau Region, Basin Region in the northwestern Hebei Province and Yanshan Mountainous Region in the Northern Hebei Province were analyzed, and total NPP spatial distribution of the study area in 2001 was discussed. Based on 16-day MODIS Fraction of Photosynthetically Active Radiation absorbed by vegetation (FPAR) product, 16-day composite NPP dynamics were calculated using CASA model; the seasonal dynamics of vegetation NPP in three subreglons were also analyzed. Result reveals that the total NPP of the study area in 2001 was 25.1877 × 10^6gC/(m^2.a), and NPP in 2001 ranged from 2 to 608gC/(m^2-a), with an average of 337.516gC/(m^2.a). NPP of the study area in 2001 accumulated mainly from May to September (DOY 129-272), high NIP values appeared from June to August (DOY 177-204), and the maximum NPP appeared from late July to mid-August (DOY 209-224).
基金funded by the Deanship Scientific Research(DSR),King Abdulaziz University,Jeddah,under the GrantNo.KEP-PhD:21-130-1443.
文摘In this article,we highlight a new three-parameter heavy-tailed lifetime distribution that aims to extend the modeling possibilities of the Lomax distribution.It is called the extended Lomax distribution.The considered distribution naturally appears as the distribution of a transformation of a random variable following the logweighted power distribution recently introduced for percentage or proportion data analysis purposes.As a result,its cumulative distribution has the same functional basis as that of the Lomax distribution,but with a novel special logarithmic term depending on several parameters.The modulation of this logarithmic term reveals new types of asymetrical shapes,implying a modeling horizon beyond that of the Lomax distribution.In the first part,we examine several of its mathematical properties,such as the shapes of the related probability and hazard rate functions;stochastic comparisons;manageable expansions for various moments;and quantile properties.In particular,based on the quantile functions,various actuarial measures are discussed.In the second part,the distribution’s applicability is investigated with the use of themaximumlikelihood estimationmethod.The behavior of the obtained parameter estimates is validated by a simulation work.Insurance claim data are analyzed.We show that the proposed distribution outperforms eight well-known distributions,including the Lomax distribution and several extended Lomax distributions.In addition,we demonstrate that it gives preferable inferences from these competitor distributions in terms of risk measures.
文摘It is crucial,while using healthcare data,to assess the advantages of data privacy against the possible drawbacks.Data from several sources must be combined for use in many data mining applications.The medical practitioner may use the results of association rule mining performed on this aggregated data to better personalize patient care and implement preventive measures.Historically,numerous heuristics(e.g.,greedy search)and metaheuristics-based techniques(e.g.,evolutionary algorithm)have been created for the positive association rule in privacy preserving data mining(PPDM).When it comes to connecting seemingly unrelated diseases and drugs,negative association rules may be more informative than their positive counterparts.It is well-known that during negative association rules mining,a large number of uninteresting rules are formed,making this a difficult problem to tackle.In this research,we offer an adaptive method for negative association rule mining in vertically partitioned healthcare datasets that respects users’privacy.The applied approach dynamically determines the transactions to be interrupted for information hiding,as opposed to predefining them.This study introduces a novel method for addressing the problem of negative association rules in healthcare data mining,one that is based on the Tabu-genetic optimization paradigm.Tabu search is advantageous since it removes a huge number of unnecessary rules and item sets.Experiments using benchmark healthcare datasets prove that the discussed scheme outperforms state-of-the-art solutions in terms of decreasing side effects and data distortions,as measured by the indicator of hiding failure.
文摘A new method of establishing rolling load distribution model was developed by online intelligent information-processing technology for plate rolling. The model combines knowledge model and mathematical model with using knowledge discovery in database (KDD) and data mining (DM) as the start. The online maintenance and optimization of the load model are realized. The effectiveness of this new method was testified by offline simulation and online application.
文摘In the data retrieval process of the Data recommendation system,the matching prediction and similarity identification take place a major role in the ontology.In that,there are several methods to improve the retrieving process with improved accuracy and to reduce the searching time.Since,in the data recommendation system,this type of data searching becomes complex to search for the best matching for given query data and fails in the accuracy of the query recommendation process.To improve the performance of data validation,this paper proposed a novel model of data similarity estimation and clustering method to retrieve the relevant data with the best matching in the big data processing.In this paper advanced model of the Logarithmic Directionality Texture Pattern(LDTP)method with a Metaheuristic Pattern Searching(MPS)system was used to estimate the similarity between the query data in the entire database.The overall work was implemented for the application of the data recommendation process.These are all indexed and grouped as a cluster to form a paged format of database structure which can reduce the computation time while at the searching period.Also,with the help of a neural network,the relevancies of feature attributes in the database are predicted,and the matching index was sorted to provide the recommended data for given query data.This was achieved by using the Distributional Recurrent Neural Network(DRNN).This is an enhanced model of Neural Network technology to find the relevancy based on the correlation factor of the feature set.The training process of the DRNN classifier was carried out by estimating the correlation factor of the attributes of the dataset.These are formed as clusters and paged with proper indexing based on the MPS parameter of similarity metric.The overall performance of the proposed work can be evaluated by varying the size of the training database by 60%,70%,and 80%.The parameters that are considered for performance analysis are Precision,Recall,F1-score and the accuracy of data retrieval,the query recommendation output,and comparison with other state-of-art methods.
文摘Considering that the measurement devices of the distribution network are becoming more and more abundant, on the basis of the traditional Supervisory Control And Data Acquisition (SCADA) measurement system, Phasor measurement unit (PMU) devices are also gradually applied to the distribution network. So when estimating the state of the distribution network, the above two devices need to be used. However, because the data of different measurement systems are different, it is necessary to balance this difference so that the data of different systems can be compatible to achieve the purpose of effective utilization of the estimated power distribution state. To this end, this paper starts with three aspects of data accuracy of the two measurement systems, data time section and data refresh frequency to eliminate the differences between system data, and then considers the actual situation of the three-phase asymmetry of the distribution network. The three-phase state estimation equations are constructed by the branch current method, and finally the state estimation results are solved by the weighted least square method.
文摘Type-I censoring mechanism arises when the number of units experiencing the event is random but the total duration of the study is fixed. There are a number of mathematical approaches developed to handle this type of data. The purpose of the research was to estimate the three parameters of the Frechet distribution via the frequentist Maximum Likelihood and the Bayesian Estimators. In this paper, the maximum likelihood method (MLE) is not available of the three parameters in the closed forms;therefore, it was solved by the numerical methods. Similarly, the Bayesian estimators are implemented using Jeffreys and gamma priors with two loss functions, which are: squared error loss function and Linear Exponential Loss Function (LINEX). The parameters of the Frechet distribution via Bayesian cannot be obtained analytically and therefore Markov Chain Monte Carlo is used, where the full conditional distribution for the three parameters is obtained via Metropolis-Hastings algorithm. Comparisons of the estimators are obtained using Mean Square Errors (MSE) to determine the best estimator of the three parameters of the Frechet distribution. The results show that the Bayesian estimation under Linear Exponential Loss Function based on Type-I censored data is a better estimator for all the parameter estimates when the value of the loss parameter is positive.