This paper proposes one method of feature selection by using Bayes' theorem. The purpose of the proposed method is to reduce the computational complexity and increase the classification accuracy of the selected featu...This paper proposes one method of feature selection by using Bayes' theorem. The purpose of the proposed method is to reduce the computational complexity and increase the classification accuracy of the selected feature subsets. The dependence between two attributes (binary) is determined based on the probabilities of their joint values that contribute to positive and negative classification decisions. If opposing sets of attribute values do not lead to opposing classification decisions (zero probability), then the two attributes are considered independent of each other, otherwise dependent, and one of them can be removed and thus the number of attributes is reduced. The process must be repeated on all combinations of attributes. The paper also evaluates the approach by comparing it with existing feature selection algorithms over 8 datasets from University of California, Irvine (UCI) machine learning databases. The proposed method shows better results in terms of number of selected features, classification accuracy, and running time than most existing algorithms.展开更多
In the face of a growing number of large-scale data sets, affinity propagation clustering algorithm to calculate the process required to build the similarity matrix, will bring huge storage and computation. Therefore,...In the face of a growing number of large-scale data sets, affinity propagation clustering algorithm to calculate the process required to build the similarity matrix, will bring huge storage and computation. Therefore, this paper proposes an improved affinity propagation clustering algorithm. First, add the subtraction clustering, using the density value of the data points to obtain the point of initial clusters. Then, calculate the similarity distance between the initial cluster points, and reference the idea of semi-supervised clustering, adding pairs restriction information, structure sparse similarity matrix. Finally, the cluster representative points conduct AP clustering until a suitable cluster division.Experimental results show that the algorithm allows the calculation is greatly reduced, the similarity matrix storage capacity is also reduced, and better than the original algorithm on the clustering effect and processing speed.展开更多
Background A task assigned to space exploration satellites involves detecting the physical environment within a certain space.However,space detection data are complex and abstract.These data are not conducive for rese...Background A task assigned to space exploration satellites involves detecting the physical environment within a certain space.However,space detection data are complex and abstract.These data are not conducive for researchers'visual perceptions of the evolution and interaction of events in the space environment.Methods A time-series dynamic data sampling method for large-scale space was proposed for sample detection data in space and time,and the corresponding relationships between data location features and other attribute features were established.A tone-mapping method based on statistical histogram equalization was proposed and applied to the final attribute feature data.The visualization process is optimized for rendering by merging materials,reducing the number of patches,and performing other operations.Results The results of sampling,feature extraction,and uniform visualization of the detection data of complex types,long duration spans,and uneven spatial distributions were obtained.The real-time visualization of large-scale spatial structures using augmented reality devices,particularly low-performance devices,was also investigated.Conclusions The proposed visualization system can reconstruct the three-dimensional structure of a large-scale space,express the structure and changes in the spatial environment using augmented reality,and assist in intuitively discovering spatial environmental events and evolutionary rules.展开更多
Large-scale wireless sensor networks(WSNs)play a critical role in monitoring dangerous scenarios and responding to medical emergencies.However,the inherent instability and error-prone nature of wireless links present ...Large-scale wireless sensor networks(WSNs)play a critical role in monitoring dangerous scenarios and responding to medical emergencies.However,the inherent instability and error-prone nature of wireless links present significant challenges,necessitating efficient data collection and reliable transmission services.This paper addresses the limitations of existing data transmission and recovery protocols by proposing a systematic end-to-end design tailored for medical event-driven cluster-based large-scale WSNs.The primary goal is to enhance the reliability of data collection and transmission services,ensuring a comprehensive and practical approach.Our approach focuses on refining the hop-count-based routing scheme to achieve fairness in forwarding reliability.Additionally,it emphasizes reliable data collection within clusters and establishes robust data transmission over multiple hops.These systematic improvements are designed to optimize the overall performance of the WSN in real-world scenarios.Simulation results of the proposed protocol validate its exceptional performance compared to other prominent data transmission schemes.The evaluation spans varying sensor densities,wireless channel conditions,and packet transmission rates,showcasing the protocol’s superiority in ensuring reliable and efficient data transfer.Our systematic end-to-end design successfully addresses the challenges posed by the instability of wireless links in large-scaleWSNs.By prioritizing fairness,reliability,and efficiency,the proposed protocol demonstrates its efficacy in enhancing data collection and transmission services,thereby offering a valuable contribution to the field of medical event-drivenWSNs.展开更多
Social media data created a paradigm shift in assessing situational awareness during a natural disaster or emergencies such as wildfire, hurricane, tropical storm etc. Twitter as an emerging data source is an effectiv...Social media data created a paradigm shift in assessing situational awareness during a natural disaster or emergencies such as wildfire, hurricane, tropical storm etc. Twitter as an emerging data source is an effective and innovative digital platform to observe trend from social media users’ perspective who are direct or indirect witnesses of the calamitous event. This paper aims to collect and analyze twitter data related to the recent wildfire in California to perform a trend analysis by classifying firsthand and credible information from Twitter users. This work investigates tweets on the recent wildfire in California and classifies them based on witnesses into two types: 1) direct witnesses and 2) indirect witnesses. The collected and analyzed information can be useful for law enforcement agencies and humanitarian organizations for communication and verification of the situational awareness during wildfire hazards. Trend analysis is an aggregated approach that includes sentimental analysis and topic modeling performed through domain-expert manual annotation and machine learning. Trend analysis ultimately builds a fine-grained analysis to assess evacuation routes and provide valuable information to the firsthand emergency responders<span style="font-family:Verdana;">.</span>展开更多
Both computer science and archival science are concerned with archiving large-scale data,but they have different focuses.Large-scale data archiving in computer science focuses on technical aspects that can reduce the ...Both computer science and archival science are concerned with archiving large-scale data,but they have different focuses.Large-scale data archiving in computer science focuses on technical aspects that can reduce the cost of data storage and improve the reliability and efficiency of Big Data management.Its weaknesses lie in inadequate and non-standardized management.Archiving in archival science focuses on the management aspects and neglects the necessary technical considerations,resulting in high storage and retention costs and poor ability to manage Big Data.Therefore,the integration of large-scale data archiving and archival theory can balance the existing research limitations of the two fields and propose two research topics for related research-archival management of Big Data and large-scale management of archived Big Data.展开更多
Kernel is a kind of data summary which is elaborately extracted from a large dataset.Given a problem,the solution obtained from the kernel is an approximate version of the solution obtained from the whole dataset with...Kernel is a kind of data summary which is elaborately extracted from a large dataset.Given a problem,the solution obtained from the kernel is an approximate version of the solution obtained from the whole dataset with a provable approximate ratio.It is widely used in geometric optimization,clustering,and approximate query processing,etc.,for scaling them up to massive data.In this paper,we focus on the minimumε-kernel(MK)computation that asks for a kernel of the smallest size for large-scale data processing.For the open problem presented by Wang et al.that whether the minimumε-coreset(MC)problem and the MK problem can be reduced to each other,we first formalize the MK problem and analyze its complexity.Due to the NP-hardness of the MK problem in three or higher dimensions,an approximate algorithm,namely Set Cover-Based Minimumε-Kernel algorithm(SCMK),is developed to solve it.We prove that the MC problem and the MK problem can be Turing-reduced to each other.Then,we discuss the update of MK under insertion and deletion operations,respectively.Finally,a randomized algorithm,called the Randomized Algorithm of Set Cover-Based Minimumε-Kernel algorithm(RA-SCMK),is utilized to further reduce the complexity of SCMK.The efficiency and effectiveness of SCMK and RA-SCMK are verified by experimental results on real-world and synthetic datasets.Experiments show that the kernel sizes of SCMK are 2x and 17.6x smaller than those of an ANN-based method on real-world and synthetic datasets,respectively.The speedup ratio of SCMK over the ANN-based method is 5.67 on synthetic datasets.RA-SCMK runs up to three times faster than SCMK on synthetic datasets.展开更多
Traditional large-scale multi-objective optimization algorithms(LSMOEAs)encounter difficulties when dealing with sparse large-scale multi-objective optimization problems(SLM-OPs)where most decision variables are zero....Traditional large-scale multi-objective optimization algorithms(LSMOEAs)encounter difficulties when dealing with sparse large-scale multi-objective optimization problems(SLM-OPs)where most decision variables are zero.As a result,many algorithms use a two-layer encoding approach to optimize binary variable Mask and real variable Dec separately.Nevertheless,existing optimizers often focus on locating non-zero variable posi-tions to optimize the binary variables Mask.However,approxi-mating the sparse distribution of real Pareto optimal solutions does not necessarily mean that the objective function is optimized.In data mining,it is common to mine frequent itemsets appear-ing together in a dataset to reveal the correlation between data.Inspired by this,we propose a novel two-layer encoding learning swarm optimizer based on frequent itemsets(TELSO)to address these SLMOPs.TELSO mined the frequent terms of multiple particles with better target values to find mask combinations that can obtain better objective values for fast convergence.Experi-mental results on five real-world problems and eight benchmark sets demonstrate that TELSO outperforms existing state-of-the-art sparse large-scale multi-objective evolutionary algorithms(SLMOEAs)in terms of performance and convergence speed.展开更多
Assessment of past-climate simulations of regional climate models(RCMs)is important for understanding the reliability of RCMs when used to project future regional climate.Here,we assess the performance and discuss pos...Assessment of past-climate simulations of regional climate models(RCMs)is important for understanding the reliability of RCMs when used to project future regional climate.Here,we assess the performance and discuss possible causes of biases in a WRF-based RCM with a grid spacing of 50 km,named WRFG,from the North American Regional Climate Change Assessment Program(NARCCAP)in simulating wet season precipitation over the Central United States for a period when observational data are available.The RCM reproduces key features of the precipitation distribution characteristics during late spring to early summer,although it tends to underestimate the magnitude of precipitation.This dry bias is partially due to the model’s lack of skill in simulating nocturnal precipitation related to the lack of eastward propagating convective systems in the simulation.Inaccuracy in reproducing large-scale circulation and environmental conditions is another contributing factor.The too weak simulated pressure gradient between the Rocky Mountains and the Gulf of Mexico results in weaker southerly winds in between,leading to a reduction of warm moist air transport from the Gulf to the Central Great Plains.The simulated low-level horizontal convergence fields are less favorable for upward motion than in the NARR and hence,for the development of moist convection as well.Therefore,a careful examination of an RCM’s deficiencies and the identification of the source of errors are important when using the RCM to project precipitation changes in future climate scenarios.展开更多
Data Grid integrates graphically distributed resources for solving data intensive scientific applications. Effective scheduling in Grid can reduce the amount of data transferred among nodes by submitting a job to a no...Data Grid integrates graphically distributed resources for solving data intensive scientific applications. Effective scheduling in Grid can reduce the amount of data transferred among nodes by submitting a job to a node, where most of the requested data files are available. Scheduling is a traditional problem in parallel and distributed system. However, due to special issues and goals of Grid, traditional approach is not effective in this environment any more. Therefore, it is necessary to propose methods specialized for this kind of parallel and distributed system. Another solution is to use a data replication strategy to create multiple copies of files and store them in convenient locations to shorten file access times. To utilize the above two concepts, in this paper we develop a job scheduling policy, called hierarchical job scheduling strategy (HJSS), and a dynamic data replication strategy, called advanced dynamic hierarchical replication strategy (ADHRS), to improve the data access efficiencies in a hierarchical Data Grid. HJSS uses hierarchical scheduling to reduce the search time for an appropriate computing node. It considers network characteristics, number of jobs waiting in queue, file locations, and disk read speed of storage drive at data sources. Moreover, due to the limited storage capacity, a good replica replacement algorithm is needed. We present a novel replacement strategy which deletes files in two steps when free space is not enough for the new replica: first, it deletes those files with minimum time for transferring. Second, if space is still insufficient then it considers the last time the replica was requested, number of access, size of replica and file transfer time. The simulation results show that our proposed algorithm has better performance in comparison with other algorithms in terms of job execution time, number of intercommunications, number of replications, hit ratio, computing resource usage and storage usage.展开更多
Today, data is flowing into various organizations at an unprecedented scale. The ability to scale out for processing an enhanced workload has become an important factor for the proliferation and popularization of data...Today, data is flowing into various organizations at an unprecedented scale. The ability to scale out for processing an enhanced workload has become an important factor for the proliferation and popularization of database systems. Big data applications demand and consequently lead to the developments of diverse large-scale data management systems in different organizations, ranging from traditional database vendors to new emerging Internet-based enterprises. In this survey, we investigate, characterize, and analyze the large-scale data management systems in depth and develop comprehensive taxonomies for various critical aspects covering the data model, the system architecture, and the consistency model. We map the prevailing highly scalable data management systems to the proposed taxonomies, not only to classify the common techniques but also to provide a basis for analyzing current system scalability limitations. To overcome these limitations, we predicate and highlight the possible principles that future efforts need to be undertaken for the next generation large-scale data management systems.展开更多
Sparse large-scale multi-objective optimization problems(SLMOPs)are common in science and engineering.However,the large-scale problem represents the high dimensionality of the decision space,requiring algorithms to tr...Sparse large-scale multi-objective optimization problems(SLMOPs)are common in science and engineering.However,the large-scale problem represents the high dimensionality of the decision space,requiring algorithms to traverse vast expanse with limited computational resources.Furthermore,in the context of sparse,most variables in Pareto optimal solutions are zero,making it difficult for algorithms to identify non-zero variables efficiently.This paper is dedicated to addressing the challenges posed by SLMOPs.To start,we introduce innovative objective functions customized to mine maximum and minimum candidate sets.This substantial enhancement dramatically improves the efficacy of frequent pattern mining.In this way,selecting candidate sets is no longer based on the quantity of nonzero variables they contain but on a higher proportion of nonzero variables within specific dimensions.Additionally,we unveil a novel approach to association rule mining,which delves into the intricate relationships between non-zero variables.This novel methodology aids in identifying sparse distributions that can potentially expedite reductions in the objective function value.We extensively tested our algorithm across eight benchmark problems and four real-world SLMOPs.The results demonstrate that our approach achieves competitive solutions across various challenges.展开更多
How to effectively reduce the energy consumption of large-scale data centers is a key issue in cloud computing. This paper presents a novel low-power task scheduling algorithm (L3SA) for large-scale cloud data cente...How to effectively reduce the energy consumption of large-scale data centers is a key issue in cloud computing. This paper presents a novel low-power task scheduling algorithm (L3SA) for large-scale cloud data centers. The winner tree is introduced to make the data nodes as the leaf nodes of the tree and the final winner on the purpose of reducing energy consumption is selected. The complexity of large-scale cloud data centers is fully consider, and the task comparson coefficient is defined to make task scheduling strategy more reasonable. Experiments and performance analysis show that the proposed algorithm can effectively improve the node utilization, and reduce the overall power consumption of the cloud data center.展开更多
Bedding slope is a typical heterogeneous slope consisting of different soil/rock layers and is likely to slide along the weakest interface.Conventional slope protection methods for bedding slopes,such as retaining wal...Bedding slope is a typical heterogeneous slope consisting of different soil/rock layers and is likely to slide along the weakest interface.Conventional slope protection methods for bedding slopes,such as retaining walls,stabilizing piles,and anchors,are time-consuming and labor-and energy-intensive.This study proposes an innovative polymer grout method to improve the bearing capacity and reduce the displacement of bedding slopes.A series of large-scale model tests were carried out to verify the effectiveness of polymer grout in protecting bedding slopes.Specifically,load-displacement relationships and failure patterns were analyzed for different testing slopes with various dosages of polymer.Results show the great potential of polymer grout in improving bearing capacity,reducing settlement,and protecting slopes from being crushed under shearing.The polymer-treated slopes remained structurally intact,while the untreated slope exhibited considerable damage when subjected to loads surpassing the bearing capacity.It is also found that polymer-cemented soils concentrate around the injection pipe,forming a fan-shaped sheet-like structure.This study proves the improvement of polymer grouting for bedding slope treatment and will contribute to the development of a fast method to protect bedding slopes from landslides.展开更多
This article introduces the concept of load aggregation,which involves a comprehensive analysis of loads to acquire their external characteristics for the purpose of modeling and analyzing power systems.The online ide...This article introduces the concept of load aggregation,which involves a comprehensive analysis of loads to acquire their external characteristics for the purpose of modeling and analyzing power systems.The online identification method is a computer-involved approach for data collection,processing,and system identification,commonly used for adaptive control and prediction.This paper proposes a method for dynamically aggregating large-scale adjustable loads to support high proportions of new energy integration,aiming to study the aggregation characteristics of regional large-scale adjustable loads using online identification techniques and feature extraction methods.The experiment selected 300 central air conditioners as the research subject and analyzed their regulation characteristics,economic efficiency,and comfort.The experimental results show that as the adjustment time of the air conditioner increases from 5 minutes to 35 minutes,the stable adjustment quantity during the adjustment period decreases from 28.46 to 3.57,indicating that air conditioning loads can be controlled over a long period and have better adjustment effects in the short term.Overall,the experimental results of this paper demonstrate that analyzing the aggregation characteristics of regional large-scale adjustable loads using online identification techniques and feature extraction algorithms is effective.展开更多
Accurate positioning is one of the essential requirements for numerous applications of remote sensing data,especially in the event of a noisy or unreliable satellite signal.Toward this end,we present a novel framework...Accurate positioning is one of the essential requirements for numerous applications of remote sensing data,especially in the event of a noisy or unreliable satellite signal.Toward this end,we present a novel framework for aircraft geo-localization in a large range that only requires a downward-facing monocular camera,an altimeter,a compass,and an open-source Vector Map(VMAP).The algorithm combines the matching and particle filter methods.Shape vector and correlation between two building contour vectors are defined,and a coarse-to-fine building vector matching(CFBVM)method is proposed in the matching stage,for which the original matching results are described by the Gaussian mixture model(GMM).Subsequently,an improved resampling strategy is designed to reduce computing expenses with a huge number of initial particles,and a credibility indicator is designed to avoid location mistakes in the particle filter stage.An experimental evaluation of the approach based on flight data is provided.On a flight at a height of 0.2 km over a flight distance of 2 km,the aircraft is geo-localized in a reference map of 11,025 km~2using 0.09 km~2aerial images without any prior information.The absolute localization error is less than 10 m.展开更多
Background: The importance of structurally diverse forests for the conservation of biodiversity and provision of a wide range of ecosystem services has been widely recognised. However, tools to quantify structural div...Background: The importance of structurally diverse forests for the conservation of biodiversity and provision of a wide range of ecosystem services has been widely recognised. However, tools to quantify structural diversity of forests in an objective and quantitative way across many forest types and sites are still needed, for example to support biodiversity monitoring. The existing approaches to quantify forest structural diversity are based on small geographical regions or single forest types, typically using only small data sets.Results: Here we developed an index of structural diversity based on National Forest Inventory(NFI) data of BadenWurttemberg, Germany, a state with 1.3 million ha of diverse forest types in different ownerships. Based on a literature review, 11 aspects of structural diversity were identified a priori as crucially important to describe structural diversity. An initial comprehensive list of 52 variables derived from National Forest Inventory(NFI) data related to structural diversity was reduced by applying five selection criteria to arrive at one variable for each aspect of structural diversity. These variables comprise 1) quadratic mean diameter at breast height(DBH), 2) standard deviation of DBH, 3) standard deviation of stand height, 4) number of decay classes, 5) bark-diversity index, 6) trees with DBH ≥ 40 cm, 7) diversity of flowering and fructification, 8) average mean diameter of downed deadwood, 9) mean DBH of standing deadwood, 10) tree species richness and 11) tree species richness in the regeneration layer. These variables were combined into a simple,additive index to quantify the level of structural diversity, which assumes values between 0 and 1. We applied this index in an exemplary way to broad forest categories and ownerships to assess its feasibility to analyse structural diversity in large-scale forest inventories.Conclusions: The forest structure index presented here can be derived in a similar way from standard inventory variables for most other large-scale forest inventories to provide important information about biodiversity relevant forest conditions and thus provide an evidence-base for forest management and planning as well as reporting.展开更多
With the development of big data and social computing,large-scale group decisionmaking(LGDM)is nowmerging with social networks.Using social network analysis(SNA),this study proposes an LGDM consensus model that consid...With the development of big data and social computing,large-scale group decisionmaking(LGDM)is nowmerging with social networks.Using social network analysis(SNA),this study proposes an LGDM consensus model that considers the trust relationship among decisionmakers(DMs).In the process of consensusmeasurement:the social network is constructed according to the social relationship among DMs,and the Louvain method is introduced to classify social networks to form subgroups.In this study,the weights of each decision maker and each subgroup are computed by comprehensive network weights and trust weights.In the process of consensus improvement:A feedback mechanism with four identification and two direction rules is designed to guide the consensus of the improvement process.Based on the trust relationship among DMs,the preferences are modified,and the corresponding social network is updated to accelerate the consensus.Compared with the previous research,the proposedmodel not only allows the subgroups to be reconstructed and updated during the adjustment process,but also improves the accuracy of the adjustment by the feedbackmechanism.Finally,an example analysis is conducted to verify the effectiveness and flexibility of the proposed method.Moreover,compared with previous studies,the superiority of the proposed method in solving the LGDM problem is highlighted.展开更多
The large-scale multi-objective optimization algorithm(LSMOA),based on the grouping of decision variables,is an advanced method for handling high-dimensional decision variables.However,in practical problems,the intera...The large-scale multi-objective optimization algorithm(LSMOA),based on the grouping of decision variables,is an advanced method for handling high-dimensional decision variables.However,in practical problems,the interaction among decision variables is intricate,leading to large group sizes and suboptimal optimization effects;hence a large-scale multi-objective optimization algorithm based on weighted overlapping grouping of decision variables(MOEAWOD)is proposed in this paper.Initially,the decision variables are perturbed and categorized into convergence and diversity variables;subsequently,the convergence variables are subdivided into groups based on the interactions among different decision variables.If the size of a group surpasses the set threshold,that group undergoes a process of weighting and overlapping grouping.Specifically,the interaction strength is evaluated based on the interaction frequency and number of objectives among various decision variables.The decision variable with the highest interaction in the group is identified and disregarded,and the remaining variables are then reclassified into subgroups.Finally,the decision variable with the strongest interaction is added to each subgroup.MOEAWOD minimizes the interactivity between different groups and maximizes the interactivity of decision variables within groups,which contributed to the optimized direction of convergence and diversity exploration with different groups.MOEAWOD was subjected to testing on 18 benchmark large-scale optimization problems,and the experimental results demonstrate the effectiveness of our methods.Compared with the other algorithms,our method is still at an advantage.展开更多
Protein-protein interactions are of great significance for human to understand the functional mechanisms of proteins.With the rapid development of high-throughput genomic technologies,massive protein-protein interacti...Protein-protein interactions are of great significance for human to understand the functional mechanisms of proteins.With the rapid development of high-throughput genomic technologies,massive protein-protein interaction(PPI)data have been generated,making it very difficult to analyze them efficiently.To address this problem,this paper presents a distributed framework by reimplementing one of state-of-the-art algorithms,i.e.,CoFex,using MapReduce.To do so,an in-depth analysis of its limitations is conducted from the perspectives of efficiency and memory consumption when applying it for large-scale PPI data analysis and prediction.Respective solutions are then devised to overcome these limitations.In particular,we adopt a novel tree-based data structure to reduce the heavy memory consumption caused by the huge sequence information of proteins.After that,its procedure is modified by following the MapReduce framework to take the prediction task distributively.A series of extensive experiments have been conducted to evaluate the performance of our framework in terms of both efficiency and accuracy.Experimental results well demonstrate that the proposed framework can considerably improve its computational efficiency by more than two orders of magnitude while retaining the same high accuracy.展开更多
文摘This paper proposes one method of feature selection by using Bayes' theorem. The purpose of the proposed method is to reduce the computational complexity and increase the classification accuracy of the selected feature subsets. The dependence between two attributes (binary) is determined based on the probabilities of their joint values that contribute to positive and negative classification decisions. If opposing sets of attribute values do not lead to opposing classification decisions (zero probability), then the two attributes are considered independent of each other, otherwise dependent, and one of them can be removed and thus the number of attributes is reduced. The process must be repeated on all combinations of attributes. The paper also evaluates the approach by comparing it with existing feature selection algorithms over 8 datasets from University of California, Irvine (UCI) machine learning databases. The proposed method shows better results in terms of number of selected features, classification accuracy, and running time than most existing algorithms.
基金This research has been partially supported by the national natural science foundation of China (51175169) and the national science and technology support program (2012BAF02B01).
文摘In the face of a growing number of large-scale data sets, affinity propagation clustering algorithm to calculate the process required to build the similarity matrix, will bring huge storage and computation. Therefore, this paper proposes an improved affinity propagation clustering algorithm. First, add the subtraction clustering, using the density value of the data points to obtain the point of initial clusters. Then, calculate the similarity distance between the initial cluster points, and reference the idea of semi-supervised clustering, adding pairs restriction information, structure sparse similarity matrix. Finally, the cluster representative points conduct AP clustering until a suitable cluster division.Experimental results show that the algorithm allows the calculation is greatly reduced, the similarity matrix storage capacity is also reduced, and better than the original algorithm on the clustering effect and processing speed.
文摘Background A task assigned to space exploration satellites involves detecting the physical environment within a certain space.However,space detection data are complex and abstract.These data are not conducive for researchers'visual perceptions of the evolution and interaction of events in the space environment.Methods A time-series dynamic data sampling method for large-scale space was proposed for sample detection data in space and time,and the corresponding relationships between data location features and other attribute features were established.A tone-mapping method based on statistical histogram equalization was proposed and applied to the final attribute feature data.The visualization process is optimized for rendering by merging materials,reducing the number of patches,and performing other operations.Results The results of sampling,feature extraction,and uniform visualization of the detection data of complex types,long duration spans,and uneven spatial distributions were obtained.The real-time visualization of large-scale spatial structures using augmented reality devices,particularly low-performance devices,was also investigated.Conclusions The proposed visualization system can reconstruct the three-dimensional structure of a large-scale space,express the structure and changes in the spatial environment using augmented reality,and assist in intuitively discovering spatial environmental events and evolutionary rules.
文摘Large-scale wireless sensor networks(WSNs)play a critical role in monitoring dangerous scenarios and responding to medical emergencies.However,the inherent instability and error-prone nature of wireless links present significant challenges,necessitating efficient data collection and reliable transmission services.This paper addresses the limitations of existing data transmission and recovery protocols by proposing a systematic end-to-end design tailored for medical event-driven cluster-based large-scale WSNs.The primary goal is to enhance the reliability of data collection and transmission services,ensuring a comprehensive and practical approach.Our approach focuses on refining the hop-count-based routing scheme to achieve fairness in forwarding reliability.Additionally,it emphasizes reliable data collection within clusters and establishes robust data transmission over multiple hops.These systematic improvements are designed to optimize the overall performance of the WSN in real-world scenarios.Simulation results of the proposed protocol validate its exceptional performance compared to other prominent data transmission schemes.The evaluation spans varying sensor densities,wireless channel conditions,and packet transmission rates,showcasing the protocol’s superiority in ensuring reliable and efficient data transfer.Our systematic end-to-end design successfully addresses the challenges posed by the instability of wireless links in large-scaleWSNs.By prioritizing fairness,reliability,and efficiency,the proposed protocol demonstrates its efficacy in enhancing data collection and transmission services,thereby offering a valuable contribution to the field of medical event-drivenWSNs.
文摘Social media data created a paradigm shift in assessing situational awareness during a natural disaster or emergencies such as wildfire, hurricane, tropical storm etc. Twitter as an emerging data source is an effective and innovative digital platform to observe trend from social media users’ perspective who are direct or indirect witnesses of the calamitous event. This paper aims to collect and analyze twitter data related to the recent wildfire in California to perform a trend analysis by classifying firsthand and credible information from Twitter users. This work investigates tweets on the recent wildfire in California and classifies them based on witnesses into two types: 1) direct witnesses and 2) indirect witnesses. The collected and analyzed information can be useful for law enforcement agencies and humanitarian organizations for communication and verification of the situational awareness during wildfire hazards. Trend analysis is an aggregated approach that includes sentimental analysis and topic modeling performed through domain-expert manual annotation and machine learning. Trend analysis ultimately builds a fine-grained analysis to assess evacuation routes and provide valuable information to the firsthand emergency responders<span style="font-family:Verdana;">.</span>
基金supported by the National Natural Science Foundation of China(grant number 72074214).
文摘Both computer science and archival science are concerned with archiving large-scale data,but they have different focuses.Large-scale data archiving in computer science focuses on technical aspects that can reduce the cost of data storage and improve the reliability and efficiency of Big Data management.Its weaknesses lie in inadequate and non-standardized management.Archiving in archival science focuses on the management aspects and neglects the necessary technical considerations,resulting in high storage and retention costs and poor ability to manage Big Data.Therefore,the integration of large-scale data archiving and archival theory can balance the existing research limitations of the two fields and propose two research topics for related research-archival management of Big Data and large-scale management of archived Big Data.
基金the National Natural Science Foundation of China under Grant Nos.61732003,61832003,61972110 and U19A2059the National Key Research and Development Program of China under Grant No.2019YFB2101902the CCF-Baidu Open Fund CCF-BAIDU under Grant No.OF2021011.
文摘Kernel is a kind of data summary which is elaborately extracted from a large dataset.Given a problem,the solution obtained from the kernel is an approximate version of the solution obtained from the whole dataset with a provable approximate ratio.It is widely used in geometric optimization,clustering,and approximate query processing,etc.,for scaling them up to massive data.In this paper,we focus on the minimumε-kernel(MK)computation that asks for a kernel of the smallest size for large-scale data processing.For the open problem presented by Wang et al.that whether the minimumε-coreset(MC)problem and the MK problem can be reduced to each other,we first formalize the MK problem and analyze its complexity.Due to the NP-hardness of the MK problem in three or higher dimensions,an approximate algorithm,namely Set Cover-Based Minimumε-Kernel algorithm(SCMK),is developed to solve it.We prove that the MC problem and the MK problem can be Turing-reduced to each other.Then,we discuss the update of MK under insertion and deletion operations,respectively.Finally,a randomized algorithm,called the Randomized Algorithm of Set Cover-Based Minimumε-Kernel algorithm(RA-SCMK),is utilized to further reduce the complexity of SCMK.The efficiency and effectiveness of SCMK and RA-SCMK are verified by experimental results on real-world and synthetic datasets.Experiments show that the kernel sizes of SCMK are 2x and 17.6x smaller than those of an ANN-based method on real-world and synthetic datasets,respectively.The speedup ratio of SCMK over the ANN-based method is 5.67 on synthetic datasets.RA-SCMK runs up to three times faster than SCMK on synthetic datasets.
基金supported by the Scientific Research Project of Xiang Jiang Lab(22XJ02003)the University Fundamental Research Fund(23-ZZCX-JDZ-28)+5 种基金the National Science Fund for Outstanding Young Scholars(62122093)the National Natural Science Foundation of China(72071205)the Hunan Graduate Research Innovation Project(ZC23112101-10)the Hunan Natural Science Foundation Regional Joint Project(2023JJ50490)the Science and Technology Project for Young and Middle-aged Talents of Hunan(2023TJ-Z03)the Science and Technology Innovation Program of Humnan Province(2023RC1002)。
文摘Traditional large-scale multi-objective optimization algorithms(LSMOEAs)encounter difficulties when dealing with sparse large-scale multi-objective optimization problems(SLM-OPs)where most decision variables are zero.As a result,many algorithms use a two-layer encoding approach to optimize binary variable Mask and real variable Dec separately.Nevertheless,existing optimizers often focus on locating non-zero variable posi-tions to optimize the binary variables Mask.However,approxi-mating the sparse distribution of real Pareto optimal solutions does not necessarily mean that the objective function is optimized.In data mining,it is common to mine frequent itemsets appear-ing together in a dataset to reveal the correlation between data.Inspired by this,we propose a novel two-layer encoding learning swarm optimizer based on frequent itemsets(TELSO)to address these SLMOPs.TELSO mined the frequent terms of multiple particles with better target values to find mask combinations that can obtain better objective values for fast convergence.Experi-mental results on five real-world problems and eight benchmark sets demonstrate that TELSO outperforms existing state-of-the-art sparse large-scale multi-objective evolutionary algorithms(SLMOEAs)in terms of performance and convergence speed.
文摘Assessment of past-climate simulations of regional climate models(RCMs)is important for understanding the reliability of RCMs when used to project future regional climate.Here,we assess the performance and discuss possible causes of biases in a WRF-based RCM with a grid spacing of 50 km,named WRFG,from the North American Regional Climate Change Assessment Program(NARCCAP)in simulating wet season precipitation over the Central United States for a period when observational data are available.The RCM reproduces key features of the precipitation distribution characteristics during late spring to early summer,although it tends to underestimate the magnitude of precipitation.This dry bias is partially due to the model’s lack of skill in simulating nocturnal precipitation related to the lack of eastward propagating convective systems in the simulation.Inaccuracy in reproducing large-scale circulation and environmental conditions is another contributing factor.The too weak simulated pressure gradient between the Rocky Mountains and the Gulf of Mexico results in weaker southerly winds in between,leading to a reduction of warm moist air transport from the Gulf to the Central Great Plains.The simulated low-level horizontal convergence fields are less favorable for upward motion than in the NARR and hence,for the development of moist convection as well.Therefore,a careful examination of an RCM’s deficiencies and the identification of the source of errors are important when using the RCM to project precipitation changes in future climate scenarios.
文摘Data Grid integrates graphically distributed resources for solving data intensive scientific applications. Effective scheduling in Grid can reduce the amount of data transferred among nodes by submitting a job to a node, where most of the requested data files are available. Scheduling is a traditional problem in parallel and distributed system. However, due to special issues and goals of Grid, traditional approach is not effective in this environment any more. Therefore, it is necessary to propose methods specialized for this kind of parallel and distributed system. Another solution is to use a data replication strategy to create multiple copies of files and store them in convenient locations to shorten file access times. To utilize the above two concepts, in this paper we develop a job scheduling policy, called hierarchical job scheduling strategy (HJSS), and a dynamic data replication strategy, called advanced dynamic hierarchical replication strategy (ADHRS), to improve the data access efficiencies in a hierarchical Data Grid. HJSS uses hierarchical scheduling to reduce the search time for an appropriate computing node. It considers network characteristics, number of jobs waiting in queue, file locations, and disk read speed of storage drive at data sources. Moreover, due to the limited storage capacity, a good replica replacement algorithm is needed. We present a novel replacement strategy which deletes files in two steps when free space is not enough for the new replica: first, it deletes those files with minimum time for transferring. Second, if space is still insufficient then it considers the last time the replica was requested, number of access, size of replica and file transfer time. The simulation results show that our proposed algorithm has better performance in comparison with other algorithms in terms of job execution time, number of intercommunications, number of replications, hit ratio, computing resource usage and storage usage.
文摘Today, data is flowing into various organizations at an unprecedented scale. The ability to scale out for processing an enhanced workload has become an important factor for the proliferation and popularization of database systems. Big data applications demand and consequently lead to the developments of diverse large-scale data management systems in different organizations, ranging from traditional database vendors to new emerging Internet-based enterprises. In this survey, we investigate, characterize, and analyze the large-scale data management systems in depth and develop comprehensive taxonomies for various critical aspects covering the data model, the system architecture, and the consistency model. We map the prevailing highly scalable data management systems to the proposed taxonomies, not only to classify the common techniques but also to provide a basis for analyzing current system scalability limitations. To overcome these limitations, we predicate and highlight the possible principles that future efforts need to be undertaken for the next generation large-scale data management systems.
基金support by the Open Project of Xiangjiang Laboratory(22XJ02003)the University Fundamental Research Fund(23-ZZCX-JDZ-28,ZK21-07)+5 种基金the National Science Fund for Outstanding Young Scholars(62122093)the National Natural Science Foundation of China(72071205)the Hunan Graduate Research Innovation Project(CX20230074)the Hunan Natural Science Foundation Regional Joint Project(2023JJ50490)the Science and Technology Project for Young and Middle-aged Talents of Hunan(2023TJZ03)the Science and Technology Innovation Program of Humnan Province(2023RC1002).
文摘Sparse large-scale multi-objective optimization problems(SLMOPs)are common in science and engineering.However,the large-scale problem represents the high dimensionality of the decision space,requiring algorithms to traverse vast expanse with limited computational resources.Furthermore,in the context of sparse,most variables in Pareto optimal solutions are zero,making it difficult for algorithms to identify non-zero variables efficiently.This paper is dedicated to addressing the challenges posed by SLMOPs.To start,we introduce innovative objective functions customized to mine maximum and minimum candidate sets.This substantial enhancement dramatically improves the efficacy of frequent pattern mining.In this way,selecting candidate sets is no longer based on the quantity of nonzero variables they contain but on a higher proportion of nonzero variables within specific dimensions.Additionally,we unveil a novel approach to association rule mining,which delves into the intricate relationships between non-zero variables.This novel methodology aids in identifying sparse distributions that can potentially expedite reductions in the objective function value.We extensively tested our algorithm across eight benchmark problems and four real-world SLMOPs.The results demonstrate that our approach achieves competitive solutions across various challenges.
基金supported by the National Natural Science Foundation of China(6120200461272084)+9 种基金the National Key Basic Research Program of China(973 Program)(2011CB302903)the Specialized Research Fund for the Doctoral Program of Higher Education(2009322312000120113223110003)the China Postdoctoral Science Foundation Funded Project(2011M5000952012T50514)the Natural Science Foundation of Jiangsu Province(BK2011754BK2009426)the Jiangsu Postdoctoral Science Foundation Funded Project(1102103C)the Natural Science Fund of Higher Education of Jiangsu Province(12KJB520007)the Project Funded by the Priority Academic Program Development of Jiangsu Higher Education Institutions(yx002001)
文摘How to effectively reduce the energy consumption of large-scale data centers is a key issue in cloud computing. This paper presents a novel low-power task scheduling algorithm (L3SA) for large-scale cloud data centers. The winner tree is introduced to make the data nodes as the leaf nodes of the tree and the final winner on the purpose of reducing energy consumption is selected. The complexity of large-scale cloud data centers is fully consider, and the task comparson coefficient is defined to make task scheduling strategy more reasonable. Experiments and performance analysis show that the proposed algorithm can effectively improve the node utilization, and reduce the overall power consumption of the cloud data center.
基金supported by the Fujian Science Foundation for Outstanding Youth(Grant No.2023J06039)the National Natural Science Foundation of China(Grant No.41977259 and No.U2005205)Fujian Province natural resources science and technology innovation project(Grant No.KY-090000-04-2022-019)。
文摘Bedding slope is a typical heterogeneous slope consisting of different soil/rock layers and is likely to slide along the weakest interface.Conventional slope protection methods for bedding slopes,such as retaining walls,stabilizing piles,and anchors,are time-consuming and labor-and energy-intensive.This study proposes an innovative polymer grout method to improve the bearing capacity and reduce the displacement of bedding slopes.A series of large-scale model tests were carried out to verify the effectiveness of polymer grout in protecting bedding slopes.Specifically,load-displacement relationships and failure patterns were analyzed for different testing slopes with various dosages of polymer.Results show the great potential of polymer grout in improving bearing capacity,reducing settlement,and protecting slopes from being crushed under shearing.The polymer-treated slopes remained structurally intact,while the untreated slope exhibited considerable damage when subjected to loads surpassing the bearing capacity.It is also found that polymer-cemented soils concentrate around the injection pipe,forming a fan-shaped sheet-like structure.This study proves the improvement of polymer grouting for bedding slope treatment and will contribute to the development of a fast method to protect bedding slopes from landslides.
基金supported by the State Grid Science&Technology Project(5100-202114296A-0-0-00).
文摘This article introduces the concept of load aggregation,which involves a comprehensive analysis of loads to acquire their external characteristics for the purpose of modeling and analyzing power systems.The online identification method is a computer-involved approach for data collection,processing,and system identification,commonly used for adaptive control and prediction.This paper proposes a method for dynamically aggregating large-scale adjustable loads to support high proportions of new energy integration,aiming to study the aggregation characteristics of regional large-scale adjustable loads using online identification techniques and feature extraction methods.The experiment selected 300 central air conditioners as the research subject and analyzed their regulation characteristics,economic efficiency,and comfort.The experimental results show that as the adjustment time of the air conditioner increases from 5 minutes to 35 minutes,the stable adjustment quantity during the adjustment period decreases from 28.46 to 3.57,indicating that air conditioning loads can be controlled over a long period and have better adjustment effects in the short term.Overall,the experimental results of this paper demonstrate that analyzing the aggregation characteristics of regional large-scale adjustable loads using online identification techniques and feature extraction algorithms is effective.
文摘Accurate positioning is one of the essential requirements for numerous applications of remote sensing data,especially in the event of a noisy or unreliable satellite signal.Toward this end,we present a novel framework for aircraft geo-localization in a large range that only requires a downward-facing monocular camera,an altimeter,a compass,and an open-source Vector Map(VMAP).The algorithm combines the matching and particle filter methods.Shape vector and correlation between two building contour vectors are defined,and a coarse-to-fine building vector matching(CFBVM)method is proposed in the matching stage,for which the original matching results are described by the Gaussian mixture model(GMM).Subsequently,an improved resampling strategy is designed to reduce computing expenses with a huge number of initial particles,and a credibility indicator is designed to avoid location mistakes in the particle filter stage.An experimental evaluation of the approach based on flight data is provided.On a flight at a height of 0.2 km over a flight distance of 2 km,the aircraft is geo-localized in a reference map of 11,025 km~2using 0.09 km~2aerial images without any prior information.The absolute localization error is less than 10 m.
基金supported by a grant from the Ministry of Science,Research and the Arts of Baden-Württemberg(7533-10-5-78)to Jürgen BauhusFelix Storch received additional support through the BBW ForWerts Graduate Program
文摘Background: The importance of structurally diverse forests for the conservation of biodiversity and provision of a wide range of ecosystem services has been widely recognised. However, tools to quantify structural diversity of forests in an objective and quantitative way across many forest types and sites are still needed, for example to support biodiversity monitoring. The existing approaches to quantify forest structural diversity are based on small geographical regions or single forest types, typically using only small data sets.Results: Here we developed an index of structural diversity based on National Forest Inventory(NFI) data of BadenWurttemberg, Germany, a state with 1.3 million ha of diverse forest types in different ownerships. Based on a literature review, 11 aspects of structural diversity were identified a priori as crucially important to describe structural diversity. An initial comprehensive list of 52 variables derived from National Forest Inventory(NFI) data related to structural diversity was reduced by applying five selection criteria to arrive at one variable for each aspect of structural diversity. These variables comprise 1) quadratic mean diameter at breast height(DBH), 2) standard deviation of DBH, 3) standard deviation of stand height, 4) number of decay classes, 5) bark-diversity index, 6) trees with DBH ≥ 40 cm, 7) diversity of flowering and fructification, 8) average mean diameter of downed deadwood, 9) mean DBH of standing deadwood, 10) tree species richness and 11) tree species richness in the regeneration layer. These variables were combined into a simple,additive index to quantify the level of structural diversity, which assumes values between 0 and 1. We applied this index in an exemplary way to broad forest categories and ownerships to assess its feasibility to analyse structural diversity in large-scale forest inventories.Conclusions: The forest structure index presented here can be derived in a similar way from standard inventory variables for most other large-scale forest inventories to provide important information about biodiversity relevant forest conditions and thus provide an evidence-base for forest management and planning as well as reporting.
基金The work was supported by Humanities and Social Sciences Fund of the Ministry of Education(No.22YJA630119)the National Natural Science Foundation of China(No.71971051)Natural Science Foundation of Hebei Province(No.G2021501004).
文摘With the development of big data and social computing,large-scale group decisionmaking(LGDM)is nowmerging with social networks.Using social network analysis(SNA),this study proposes an LGDM consensus model that considers the trust relationship among decisionmakers(DMs).In the process of consensusmeasurement:the social network is constructed according to the social relationship among DMs,and the Louvain method is introduced to classify social networks to form subgroups.In this study,the weights of each decision maker and each subgroup are computed by comprehensive network weights and trust weights.In the process of consensus improvement:A feedback mechanism with four identification and two direction rules is designed to guide the consensus of the improvement process.Based on the trust relationship among DMs,the preferences are modified,and the corresponding social network is updated to accelerate the consensus.Compared with the previous research,the proposedmodel not only allows the subgroups to be reconstructed and updated during the adjustment process,but also improves the accuracy of the adjustment by the feedbackmechanism.Finally,an example analysis is conducted to verify the effectiveness and flexibility of the proposed method.Moreover,compared with previous studies,the superiority of the proposed method in solving the LGDM problem is highlighted.
基金supported in part by the Central Government Guides Local Science and TechnologyDevelopment Funds(Grant No.YDZJSX2021A038)in part by theNational Natural Science Foundation of China under(Grant No.61806138)in part by the China University Industry-University-Research Collaborative Innovation Fund(Future Network Innovation Research and Application Project)(Grant 2021FNA04014).
文摘The large-scale multi-objective optimization algorithm(LSMOA),based on the grouping of decision variables,is an advanced method for handling high-dimensional decision variables.However,in practical problems,the interaction among decision variables is intricate,leading to large group sizes and suboptimal optimization effects;hence a large-scale multi-objective optimization algorithm based on weighted overlapping grouping of decision variables(MOEAWOD)is proposed in this paper.Initially,the decision variables are perturbed and categorized into convergence and diversity variables;subsequently,the convergence variables are subdivided into groups based on the interactions among different decision variables.If the size of a group surpasses the set threshold,that group undergoes a process of weighting and overlapping grouping.Specifically,the interaction strength is evaluated based on the interaction frequency and number of objectives among various decision variables.The decision variable with the highest interaction in the group is identified and disregarded,and the remaining variables are then reclassified into subgroups.Finally,the decision variable with the strongest interaction is added to each subgroup.MOEAWOD minimizes the interactivity between different groups and maximizes the interactivity of decision variables within groups,which contributed to the optimized direction of convergence and diversity exploration with different groups.MOEAWOD was subjected to testing on 18 benchmark large-scale optimization problems,and the experimental results demonstrate the effectiveness of our methods.Compared with the other algorithms,our method is still at an advantage.
基金This work was supported in part by the National Natural Science Foundation of China(61772493)the CAAI-Huawei MindSpore Open Fund(CAAIXSJLJJ-2020-004B)+4 种基金the Natural Science Foundation of Chongqing(China)(cstc2019jcyjjqX0013)Chongqing Research Program of Technology Innovation and Application(cstc2019jscx-fxydX0024,cstc2019jscx-fxydX0027,cstc2018jszx-cyzdX0041)Guangdong Province Universities and College Pearl River Scholar Funded Scheme(2019)the Pioneer Hundred Talents Program of Chinese Academy of Sciencesthe Deanship of Scientific Research(DSR)at King Abdulaziz University(G-21-135-38).
文摘Protein-protein interactions are of great significance for human to understand the functional mechanisms of proteins.With the rapid development of high-throughput genomic technologies,massive protein-protein interaction(PPI)data have been generated,making it very difficult to analyze them efficiently.To address this problem,this paper presents a distributed framework by reimplementing one of state-of-the-art algorithms,i.e.,CoFex,using MapReduce.To do so,an in-depth analysis of its limitations is conducted from the perspectives of efficiency and memory consumption when applying it for large-scale PPI data analysis and prediction.Respective solutions are then devised to overcome these limitations.In particular,we adopt a novel tree-based data structure to reduce the heavy memory consumption caused by the huge sequence information of proteins.After that,its procedure is modified by following the MapReduce framework to take the prediction task distributively.A series of extensive experiments have been conducted to evaluate the performance of our framework in terms of both efficiency and accuracy.Experimental results well demonstrate that the proposed framework can considerably improve its computational efficiency by more than two orders of magnitude while retaining the same high accuracy.