One popular strategy to reduce the enormous number of illnesses and deaths from a seasonal influenza pandemic is to obtain the influenza vaccine on time.Usually,vaccine production preparation must be done at least six...One popular strategy to reduce the enormous number of illnesses and deaths from a seasonal influenza pandemic is to obtain the influenza vaccine on time.Usually,vaccine production preparation must be done at least six months in advance,and accurate long-term influenza forecasting is essential for this.Although diverse machine learning models have been proposed for influenza forecasting,they focus on short-term forecasting,and their performance is too dependent on input variables.For a country’s longterm influenza forecasting,typical surveillance data are known to be more effective than diverse external data on the Internet.We propose a two-stage data selection scheme for worldwide surveillance data to construct a longterm forecasting model for influenza in the target country.In the first stage,using a simple forecasting model based on the country’s surveillance data,we measured the change in performance by adding surveillance data from other countries,shifted by up to 52 weeks.In the second stage,for each set of surveillance data sorted by accuracy,we incrementally added data as input if the data have a positive effect on the performance of the forecasting model in the first stage.Using the selected surveillance data,we trained a new longterm forecasting model for influenza and perform influenza forecasting for the target country.We conducted extensive experiments using six machine learning models for the three target countries to verify the effectiveness of the proposed method.We report some of the results.展开更多
Geophysical data sets are growing at an ever-increasing rate,requiring computationally efficient data selection (thinning) methods to preserve essential information.Satellites,such as WindSat,provide large data sets...Geophysical data sets are growing at an ever-increasing rate,requiring computationally efficient data selection (thinning) methods to preserve essential information.Satellites,such as WindSat,provide large data sets for assessing the accuracy and computational efficiency of data selection techniques.A new data thinning technique,based on support vector regression (SVR),is developed and tested.To manage large on-line satellite data streams,observations from WindSat are formed into subsets by Voronoi tessellation and then each is thinned by SVR (TSVR).Three experiments are performed.The first confirms the viability of TSVR for a relatively small sample,comparing it to several commonly used data thinning methods (random selection,averaging and Barnes filtering),producing a 10% thinning rate (90% data reduction),low mean absolute errors (MAE) and large correlations with the original data.A second experiment,using a larger dataset,shows TSVR retrievals with MAE < 1 m s-1 and correlations ≥ 0.98.TSVR was an order of magnitude faster than the commonly used thinning methods.A third experiment applies a two-stage pipeline to TSVR,to accommodate online data.The pipeline subsets reconstruct the wind field with the same accuracy as the second experiment,is an order of magnitude faster than the nonpipeline TSVR.Therefore,pipeline TSVR is two orders of magnitude faster than commonly used thinning methods that ingest the entire data set.This study demonstrates that TSVR pipeline thinning is an accurate and computationally efficient alternative to commonly used data selection techniques.展开更多
For the accurate extraction of cavity decay time, a selection of data points is supplemented to the weighted least square method. We derive the expected precision, accuracy and computation cost of this improved method...For the accurate extraction of cavity decay time, a selection of data points is supplemented to the weighted least square method. We derive the expected precision, accuracy and computation cost of this improved method, and examine these performances by simulation. By comparing this method with the nonlinear least square fitting (NLSF) method and the linear regression of the sum (LRS) method in derivations and simulations, we find that this method can achieve the same or even better precision, comparable accuracy, and lower computation cost. We test this method by experimental decay signals. The results are in agreement with the ones obtained from the nonlinear least square fitting method.展开更多
The interest in selecting an appropriate cloud data center is exponentially increasing due to the popularity and continuous growth of the cloud computing sector.Cloud data center selection challenges are compounded by...The interest in selecting an appropriate cloud data center is exponentially increasing due to the popularity and continuous growth of the cloud computing sector.Cloud data center selection challenges are compounded by ever-increasing users’requests and the number of data centers required to execute these requests.Cloud service broker policy defines cloud data center’s selection,which is a case of an NP-hard problem that needs a precise solution for an efficient and superior solution.Differential evolution algorithm is a metaheuristic algorithm characterized by its speed and robustness,and it is well suited for selecting an appropriate cloud data center.This paper presents a modified differential evolution algorithm-based cloud service broker policy for the most appropriate data center selection in the cloud computing environment.The differential evolution algorithm is modified using the proposed new mutation technique ensuring enhanced performance and providing an appropriate selection of data centers.The proposed policy’s superiority in selecting the most suitable data center is evaluated using the CloudAnalyst simulator.The results are compared with the state-of-arts cloud service broker policies.展开更多
Principal component analysis (PCA) combined with artificial neural networks was used to classify the spectra of 27 steel samples acquired using laser-induced breakdown spectroscopy. Three methods of spectral data se...Principal component analysis (PCA) combined with artificial neural networks was used to classify the spectra of 27 steel samples acquired using laser-induced breakdown spectroscopy. Three methods of spectral data selection, selecting all the peak lines of the spectra, selecting intensive spectral partitions and the whole spectra, were utilized to compare the infiuence of different inputs of PCA on the classification of steels. Three intensive partitions were selected based on experience and prior knowledge to compare the classification, as the partitions can obtain the best results compared to all peak lines and the whole spectra. We also used two test data sets, mean spectra after being averaged and raw spectra without any pretreatment, to verify the results of the classification. The results of this comprehensive comparison show that a back propagation network trained using the principal components of appropriate, carefully selecred spectral partitions can obtain the best results accuracy can be achieved using the intensive spectral A perfect result with 100% classification partitions ranging of 357-367 nm.展开更多
Numerical weather prediction(NWP)data possess internal inaccuracies,such as low NWP wind speed corresponding to high actual wind power generation.This study is intended to reduce the negative effects of such inaccurac...Numerical weather prediction(NWP)data possess internal inaccuracies,such as low NWP wind speed corresponding to high actual wind power generation.This study is intended to reduce the negative effects of such inaccuracies by proposing a pure data-selection framework(PDF)to choose useful data prior to modeling,thus improving the accuracy of day-ahead wind power forecasting.Briefly,we convert an entire NWP training dataset into many small subsets and then select the best subset combination via a validation set to build a forecasting model.Although a small subset can increase selection flexibility,it can also produce billions of subset combinations,resulting in computational issues.To address this problem,we incorporated metamodeling and optimization steps into PDF.We then proposed a design and analysis of the computer experiments-based metamodeling algorithm and heuristic-exhaustive search optimization algorithm,respectively.Experimental results demonstrate that(1)it is necessary to select data before constructing a forecasting model;(2)using a smaller subset will likely increase selection flexibility,leading to a more accurate forecasting model;(3)PDF can generate a better training dataset than similarity-based data selection methods(e.g.,K-means and support vector classification);and(4)choosing data before building a forecasting model produces a more accurate forecasting model compared with using a machine learning method to construct a model directly.展开更多
This paper proposes a model to analyze the massive data of electricity.Feature subset is determined by the correla-tion-based feature selection and the data-driven methods.The attribute season can be classified succes...This paper proposes a model to analyze the massive data of electricity.Feature subset is determined by the correla-tion-based feature selection and the data-driven methods.The attribute season can be classified successfully through five classi-fiers using the selected feature subset,and the best model can be determined further.The effects on analyzing electricity consump-tion of the other three attributes,including months,businesses,and meters,can be estimated using the chosen model.The data used for the project is provided by Beijing Power Supply Bureau.We use WEKA as the machine learning tool.The models we built are promising for electricity scheduling and power theft detection.展开更多
Understanding the factors shaping species' distributions is a key longstanding topic in ecology with unresolved issues. The aims were to test whether the relative contribution of abiotic factors that set the geograph...Understanding the factors shaping species' distributions is a key longstanding topic in ecology with unresolved issues. The aims were to test whether the relative contribution of abiotic factors that set the geographical range of freshwater fish species may vary spatially and/or may depend on the geographical extent that is being considered. The relative contribution of factors, to discriminate between the conditions prevailing in the area where the species is present and those existing in the considered extent, was estimated with the instability index included in the R pack- age SPEDInstabR. We used 3 different extent sizes: 1) each river basin where the species is present (local); 2) all river basins where the species is present (regional); and 3) the whole Earth (global). We used a data set of 16,543 freshwater fish species with a total of 845,764 geographical records, together with bioclimatic and topographic variables. Factors associated with tempera- ture and altitude show the highest relative contribution to explain the distribution of freshwater fishes at the smaller considered extent. Altitude and a mix of factors associated with temperature and precipitation were more important when using the regional extent. Factors associated with precipitation show the highest contribution when using the global extent. There was also spatial variability in the importance of factors, both between species and within species and from region to region. Factors associated with precipitation show a clear latitudinal trend of decreasing in importance toward the equator.展开更多
基金This research was supported by a government-wide R&D fund project for infectious disease research(GFID),Republic of Korea(Grant Number:HG19C0682).
文摘One popular strategy to reduce the enormous number of illnesses and deaths from a seasonal influenza pandemic is to obtain the influenza vaccine on time.Usually,vaccine production preparation must be done at least six months in advance,and accurate long-term influenza forecasting is essential for this.Although diverse machine learning models have been proposed for influenza forecasting,they focus on short-term forecasting,and their performance is too dependent on input variables.For a country’s longterm influenza forecasting,typical surveillance data are known to be more effective than diverse external data on the Internet.We propose a two-stage data selection scheme for worldwide surveillance data to construct a longterm forecasting model for influenza in the target country.In the first stage,using a simple forecasting model based on the country’s surveillance data,we measured the change in performance by adding surveillance data from other countries,shifted by up to 52 weeks.In the second stage,for each set of surveillance data sorted by accuracy,we incrementally added data as input if the data have a positive effect on the performance of the forecasting model in the first stage.Using the selected surveillance data,we trained a new longterm forecasting model for influenza and perform influenza forecasting for the target country.We conducted extensive experiments using six machine learning models for the three target countries to verify the effectiveness of the proposed method.We report some of the results.
基金NOAA Grant NA17RJ1227 and NSF Grant EIA-0205628 for providing financial support for this worksupported by RSF Grant 14-41-00039
文摘Geophysical data sets are growing at an ever-increasing rate,requiring computationally efficient data selection (thinning) methods to preserve essential information.Satellites,such as WindSat,provide large data sets for assessing the accuracy and computational efficiency of data selection techniques.A new data thinning technique,based on support vector regression (SVR),is developed and tested.To manage large on-line satellite data streams,observations from WindSat are formed into subsets by Voronoi tessellation and then each is thinned by SVR (TSVR).Three experiments are performed.The first confirms the viability of TSVR for a relatively small sample,comparing it to several commonly used data thinning methods (random selection,averaging and Barnes filtering),producing a 10% thinning rate (90% data reduction),low mean absolute errors (MAE) and large correlations with the original data.A second experiment,using a larger dataset,shows TSVR retrievals with MAE < 1 m s-1 and correlations ≥ 0.98.TSVR was an order of magnitude faster than the commonly used thinning methods.A third experiment applies a two-stage pipeline to TSVR,to accommodate online data.The pipeline subsets reconstruct the wind field with the same accuracy as the second experiment,is an order of magnitude faster than the nonpipeline TSVR.Therefore,pipeline TSVR is two orders of magnitude faster than commonly used thinning methods that ingest the entire data set.This study demonstrates that TSVR pipeline thinning is an accurate and computationally efficient alternative to commonly used data selection techniques.
基金supported by the Preeminent Youth Fund of Sichuan Province,China(Grant No.2012JQ0012)the National Natural Science Foundation of China(Grant Nos.11173008,10974202,and 60978049)the National Key Scientific and Research Equipment Development Project of China(Grant No.ZDYZ2013-2)
文摘For the accurate extraction of cavity decay time, a selection of data points is supplemented to the weighted least square method. We derive the expected precision, accuracy and computation cost of this improved method, and examine these performances by simulation. By comparing this method with the nonlinear least square fitting (NLSF) method and the linear regression of the sum (LRS) method in derivations and simulations, we find that this method can achieve the same or even better precision, comparable accuracy, and lower computation cost. We test this method by experimental decay signals. The results are in agreement with the ones obtained from the nonlinear least square fitting method.
基金This work was supported by Universiti Sains Malaysia under external grant(Grant Number 304/PNAV/650958/U154).
文摘The interest in selecting an appropriate cloud data center is exponentially increasing due to the popularity and continuous growth of the cloud computing sector.Cloud data center selection challenges are compounded by ever-increasing users’requests and the number of data centers required to execute these requests.Cloud service broker policy defines cloud data center’s selection,which is a case of an NP-hard problem that needs a precise solution for an efficient and superior solution.Differential evolution algorithm is a metaheuristic algorithm characterized by its speed and robustness,and it is well suited for selecting an appropriate cloud data center.This paper presents a modified differential evolution algorithm-based cloud service broker policy for the most appropriate data center selection in the cloud computing environment.The differential evolution algorithm is modified using the proposed new mutation technique ensuring enhanced performance and providing an appropriate selection of data centers.The proposed policy’s superiority in selecting the most suitable data center is evaluated using the CloudAnalyst simulator.The results are compared with the state-of-arts cloud service broker policies.
基金supported by the National High Technology Research and Development Program of China(863 Program)(No.2012AA040608)National Natural Science Foundation of China(Nos.61473279,61004131)the Development of Scientific Research Equipment Program of Chinese Academy of Sciences(No.YZ201247)
文摘Principal component analysis (PCA) combined with artificial neural networks was used to classify the spectra of 27 steel samples acquired using laser-induced breakdown spectroscopy. Three methods of spectral data selection, selecting all the peak lines of the spectra, selecting intensive spectral partitions and the whole spectra, were utilized to compare the infiuence of different inputs of PCA on the classification of steels. Three intensive partitions were selected based on experience and prior knowledge to compare the classification, as the partitions can obtain the best results compared to all peak lines and the whole spectra. We also used two test data sets, mean spectra after being averaged and raw spectra without any pretreatment, to verify the results of the classification. The results of this comprehensive comparison show that a back propagation network trained using the principal components of appropriate, carefully selecred spectral partitions can obtain the best results accuracy can be achieved using the intensive spectral A perfect result with 100% classification partitions ranging of 357-367 nm.
基金supported by the National Natural Science Foundation of China(72101066,72131005,72121001,72171062,91846301,and 71772053)Heilongjiang Natural Science Excellent Youth Fund(YQ2022G004)Key Research and Development Projects of Heilongjiang Province(JD22A003).
文摘Numerical weather prediction(NWP)data possess internal inaccuracies,such as low NWP wind speed corresponding to high actual wind power generation.This study is intended to reduce the negative effects of such inaccuracies by proposing a pure data-selection framework(PDF)to choose useful data prior to modeling,thus improving the accuracy of day-ahead wind power forecasting.Briefly,we convert an entire NWP training dataset into many small subsets and then select the best subset combination via a validation set to build a forecasting model.Although a small subset can increase selection flexibility,it can also produce billions of subset combinations,resulting in computational issues.To address this problem,we incorporated metamodeling and optimization steps into PDF.We then proposed a design and analysis of the computer experiments-based metamodeling algorithm and heuristic-exhaustive search optimization algorithm,respectively.Experimental results demonstrate that(1)it is necessary to select data before constructing a forecasting model;(2)using a smaller subset will likely increase selection flexibility,leading to a more accurate forecasting model;(3)PDF can generate a better training dataset than similarity-based data selection methods(e.g.,K-means and support vector classification);and(4)choosing data before building a forecasting model produces a more accurate forecasting model compared with using a machine learning method to construct a model directly.
基金Supported by the National Earthquake Major Project of China (201008007)the Fundamental Research Funds for Central University of China (216275645)
文摘This paper proposes a model to analyze the massive data of electricity.Feature subset is determined by the correla-tion-based feature selection and the data-driven methods.The attribute season can be classified successfully through five classi-fiers using the selected feature subset,and the best model can be determined further.The effects on analyzing electricity consump-tion of the other three attributes,including months,businesses,and meters,can be estimated using the chosen model.The data used for the project is provided by Beijing Power Supply Bureau.We use WEKA as the machine learning tool.The models we built are promising for electricity scheduling and power theft detection.
文摘Understanding the factors shaping species' distributions is a key longstanding topic in ecology with unresolved issues. The aims were to test whether the relative contribution of abiotic factors that set the geographical range of freshwater fish species may vary spatially and/or may depend on the geographical extent that is being considered. The relative contribution of factors, to discriminate between the conditions prevailing in the area where the species is present and those existing in the considered extent, was estimated with the instability index included in the R pack- age SPEDInstabR. We used 3 different extent sizes: 1) each river basin where the species is present (local); 2) all river basins where the species is present (regional); and 3) the whole Earth (global). We used a data set of 16,543 freshwater fish species with a total of 845,764 geographical records, together with bioclimatic and topographic variables. Factors associated with tempera- ture and altitude show the highest relative contribution to explain the distribution of freshwater fishes at the smaller considered extent. Altitude and a mix of factors associated with temperature and precipitation were more important when using the regional extent. Factors associated with precipitation show the highest contribution when using the global extent. There was also spatial variability in the importance of factors, both between species and within species and from region to region. Factors associated with precipitation show a clear latitudinal trend of decreasing in importance toward the equator.