Statistics are most crucial than ever due to the accessibility of huge counts of data from several domains such as finance,medicine,science,engineering,and so on.Statistical data mining(SDM)is an interdisciplinary dom...Statistics are most crucial than ever due to the accessibility of huge counts of data from several domains such as finance,medicine,science,engineering,and so on.Statistical data mining(SDM)is an interdisciplinary domain that examines huge existing databases to discover patterns and connections from the data.It varies in classical statistics on the size of datasets and on the detail that the data could not primarily be gathered based on some experimental strategy but conversely for other resolves.Thus,this paper introduces an effective statistical Data Mining for Intelligent Rainfall Prediction using Slime Mould Optimization with Deep Learning(SDMIRPSMODL)model.In the presented SDMIRP-SMODL model,the feature subset selection process is performed by the SMO algorithm,which in turn minimizes the computation complexity.For rainfall prediction.Convolution neural network with long short-term memory(CNN-LSTM)technique is exploited.At last,this study involves the pelican optimization algorithm(POA)as a hyperparameter optimizer.The experimental evaluation of the SDMIRP-SMODL approach is tested utilizing a rainfall dataset comprising 23682 samples in the negative class and 1865 samples in the positive class.The comparative outcomes reported the supremacy of the SDMIRP-SMODL model compared to existing techniques.展开更多
There has been a significant advancement in the application of statistical tools in plant pathology during the past four decades. These tools include multivariate analysis of disease dynamics involving principal compo...There has been a significant advancement in the application of statistical tools in plant pathology during the past four decades. These tools include multivariate analysis of disease dynamics involving principal component analysis, cluster analysis, factor analysis, pattern analysis, discriminant analysis, multivariate analysis of variance, correspondence analysis, canonical correlation analysis, redundancy analysis, genetic diversity analysis, and stability analysis, which involve in joint regression, additive main effects and multiplicative interactions, and genotype-by-environment interaction biplot analysis. The advanced statistical tools, such as non-parametric analysis of disease association, meta-analysis, Bayesian analysis, and decision theory, take an important place in analysis of disease dynamics. Disease forecasting methods by simulation models for plant diseases have a great potentiality in practical disease control strategies. Common mathematical tools such as monomolecular, exponential, logistic, Gompertz and linked differential equations take an important place in growth curve analysis of disease epidemics. The highly informative means of displaying a range of numerical data through construction of box and whisker plots has been suggested. The probable applications of recent advanced tools of linear and non-linear mixed models like the linear mixed model, generalized linear model, and generalized linear mixed models have been presented. The most recent technologies such as micro-array analysis, though cost effective, provide estimates of gene expressions for thousands of genes simultaneously and need attention by the molecular biologists. Some of these advanced tools can be well applied in different branches of rice research, including crop improvement, crop production, crop protection, social sciences as well as agricultural engineering. The rice research scientists should take advantage of these new opportunities adequately in adoption of the new highly potential advanced technologies while planning experimental designs, data collection, analysis and interpretation of their research data sets.展开更多
Traffic tunnels include tunnel works for traffic and transport in the areas of railway, highway, and rail transit. With many mountains and nearly one fifth of the global population, China possesses numerous large citi...Traffic tunnels include tunnel works for traffic and transport in the areas of railway, highway, and rail transit. With many mountains and nearly one fifth of the global population, China possesses numerous large cities and megapolises with rapidly growing economies and huge traffic demands. As a result, a great deal of railway, highway, and rail transit facilities are required in this country. In the past, the construction of these facilities mainly involved subgrade and bridge works; in recent years.展开更多
Two statistical validation methods were used to evaluate the confidence level of the Total Column Ozone (TCO) measurements recorded by satellite systems measuring simultaneously, one using the normal distribution and ...Two statistical validation methods were used to evaluate the confidence level of the Total Column Ozone (TCO) measurements recorded by satellite systems measuring simultaneously, one using the normal distribution and another using the Mann-Whitney test. First, the reliability of the TCO measurements was studied hemispherically. While similar coincidences and levels of significance > 0.05 were found with the two statistical tests, an enormous variability in the levels of significance throughout the year was also exposed. Then, using the same statistical comparison methods, a latitudinal study was carried out in order to elucidate the geographical distribution that gave rise to this variability. Our study reveals that between the TOMS and OMI measurements in 2005 there was only a coincidence in 50% of the latitudes, which explained the variability. This implies that for 2005, the TOMS measurements are not completely reliable, except between the -50° and -15° latitude band in the southern hemisphere and between +15° and +50° latitude band in the northern hemisphere. In the case of OMI-OMPS, we observe that between 2011 and 2016 the measurements of both satellite systems are reasonably similar with a confidence level higher than 95%. However, in 2017 a band with a width of 20° latitude centered on the equator appeared, in which the significance levels were much less than 0.05, indicating that one of the measurement systems had begun to fail. In 2018, the fault was not only located in the equator, but was also replicated in various bands in the Southern Hemisphere. We interpret this as evidence of irreversible failure in one of the measurement systems.展开更多
Many fields,such as neuroscience,are experiencing the vast prolife ration of cellular data,underscoring the need fo r organizing and interpreting large datasets.A popular approach partitions data into manageable subse...Many fields,such as neuroscience,are experiencing the vast prolife ration of cellular data,underscoring the need fo r organizing and interpreting large datasets.A popular approach partitions data into manageable subsets via hierarchical clustering,but objective methods to determine the appropriate classification granularity are missing.We recently introduced a technique to systematically identify when to stop subdividing clusters based on the fundamental principle that cells must differ more between than within clusters.Here we present the corresponding protocol to classify cellular datasets by combining datadriven unsupervised hierarchical clustering with statistical testing.These general-purpose functions are applicable to any cellular dataset that can be organized as two-dimensional matrices of numerical values,including molecula r,physiological,and anatomical datasets.We demonstrate the protocol using cellular data from the Janelia MouseLight project to chara cterize morphological aspects of neurons.展开更多
Pulsar detection has become an active research topic in radio astronomy recently.One of the essential procedures for pulsar detection is pulsar candidate sifting(PCS),a procedure for identifying potential pulsar signa...Pulsar detection has become an active research topic in radio astronomy recently.One of the essential procedures for pulsar detection is pulsar candidate sifting(PCS),a procedure for identifying potential pulsar signals in a survey.However,pulsar candidates are always class-imbalanced,as most candidates are non-pulsars such as RFI and only a tiny part of them are from real pulsars.Class imbalance can greatly affect the performance of machine learning(ML)models,resulting in a heavy cost as some real pulsars are misjudged.To deal with the problem,techniques of choosing relevant features to discriminate pulsars from non-pulsars are focused on,which is known as feature selection.Feature selection is a process of selecting a subset of the most relevant features from a feature pool.The distinguishing features between pulsars and non-pulsars can significantly improve the performance of the classifier even if the data are highly imbalanced.In this work,an algorithm for feature selection called the K-fold Relief-Greedy(KFRG)algorithm is designed.KFRG is a two-stage algorithm.In the first stage,it filters out some irrelevant features according to their K-fold Relief scores,while in the second stage,it removes the redundant features and selects the most relevant features by a forward greedy search strategy.Experiments on the data set of the High Time Resolution Universe survey verified that ML models based on KFRG are capable of PCS,correctly separating pulsars from non-pulsars even if the candidates are highly class-imbalanced.展开更多
The statistical map is usually used to indicate the quantitative features of various socio economic phenomena among regions on the base map of administrative divisions or on other base maps which connected with stati...The statistical map is usually used to indicate the quantitative features of various socio economic phenomena among regions on the base map of administrative divisions or on other base maps which connected with statistical unit. Making use of geographic information system (GIS) techniques, and supported by Auto CAD software, the author of this paper has put forward a practical method for making statistical map and developed a software (SMT) for the making of small scale statistical map using C language.展开更多
The development of adaptation measures to climate change relies on data from climate models or impact models. In order to analyze these large data sets or an ensemble of these data sets, the use of statistical methods...The development of adaptation measures to climate change relies on data from climate models or impact models. In order to analyze these large data sets or an ensemble of these data sets, the use of statistical methods is required. In this paper, the methodological approach to collecting, structuring and publishing the methods, which have been used or developed by former or present adaptation initiatives, is described. The intention is to communicate achieved knowledge and thus support future users. A key component is the participation of users in the development process. Main elements of the approach are standardized, template-based descriptions of the methods including the specific applications, references, and method assessment. All contributions have been quality checked, sorted, and placed in a larger context. The result is a report on statistical methods which is freely available as printed or online version. Examples of how to use the methods are presented in this paper and are also included in the brochure.展开更多
A novel approach to detect and filter out an unhealthy dataset from a matrix of datasets is developed, tested, and proved. The technique employs a new type of self organizing map called Accumulative Statistical Spread...A novel approach to detect and filter out an unhealthy dataset from a matrix of datasets is developed, tested, and proved. The technique employs a new type of self organizing map called Accumulative Statistical Spread Map (ASSM) to establish the destructive and negative effect a dataset will have on the rest of the matrix if stayed within that matrix. The ASSM is supported by training a neural network engine, which will determine which dataset is responsible for its inability to learn, classify and predict. The carried out experiments proved that a neural system was not able to learn in the presence of such an unhealthy dataset that possessed some deviated characteristics, even though it was produced under the same conditions and through the same process as the rest of the datasets in the matrix, and hence, it should be disqualified, and either removed completely or transferred to another matrix. Such novel approach is very useful in pattern recognition of datasets and features that do not belong to their source and could be used as an effective tool to detect suspicious activities in many areas of secure filing, communication and data storage.展开更多
Geomechanical data are never sufficient in quantity or adequately precise and accurate for design purposes in mining and civil engineering.The objective of this paper is to show the variability of rock properties at t...Geomechanical data are never sufficient in quantity or adequately precise and accurate for design purposes in mining and civil engineering.The objective of this paper is to show the variability of rock properties at the sampled point in the roadway's roof,and then,how the statistical processing of the available geomechanical data can affect the results of numerical modelling of the roadway's stability.Four cases were applied in the numerical analysis,using average values(the most common in geomechanical data analysis),average minus standard deviation,median,and average value minus statistical error.The study show that different approach to the same geomechanical data set can change the modelling results considerably.The case shows that average minus standard deviation is the most conservative and least risky.It gives the displacements and yielded elements zone in four times broader range comparing to the average values scenario,which is the least conservative option.The two other cases need to be studied further.The results obtained from them are placed between most favorable and most adverse values.Taking the average values corrected by statistical error for the numerical analysis seems to be the best solution.Moreover,the confidence level can be adjusted depending on the object importance and the assumed risk level.展开更多
Predicting seeing of astronomical observations can provide hints of the quality of optical imaging in the near future,and facilitate flexible scheduling of observation tasks to maximize the use of astronomical observa...Predicting seeing of astronomical observations can provide hints of the quality of optical imaging in the near future,and facilitate flexible scheduling of observation tasks to maximize the use of astronomical observatories.Traditional approaches to seeing prediction mostly rely on regional weather models to capture the in-dome optical turbulence patterns.Thanks to the developing of data gathering and aggregation facilities of astronomical observatories in recent years,data-driven approaches are becoming increasingly feasible and attractive to predict astronomical seeing.This paper systematically investigates data-driven approaches to seeing prediction by leveraging various big data techniques,from traditional statistical modeling,machine learning to new emerging deep learning methods,on the monitoring data of the Large sky Area Multi-Object fiber Spectroscopic Telescope(LAMOST).The raw monitoring data are preprocessed to allow for big data modeling.Then we formulate the seeing prediction task under each type of modeling framework and develop seeing prediction models through using representative big data techniques,including ARIMA and Prophet for statistical modeling,MLP and XGBoost for machine learning,and LSTM,GRU and Transformer for deep learning.We perform empirical studies on the developed models with a variety of feature configurations,yielding notable insights into the applicability of big data techniques to the seeing prediction task.展开更多
Extracting and parameterizing ionospheric waves globally and statistically is a longstanding problem. Based on the multichannel maximum entropy method(MMEM) used for studying ionospheric waves by previous work, we c...Extracting and parameterizing ionospheric waves globally and statistically is a longstanding problem. Based on the multichannel maximum entropy method(MMEM) used for studying ionospheric waves by previous work, we calculate the parameters of ionospheric waves by applying the MMEM to numerously temporally approximate and spatially close global-positioning-system radio occultation total electron content profile triples provided by the unique clustered satellites flight between years 2006 and 2007 right after the constellation observing system for meteorology, ionosphere, and climate(COSMIC) mission launch. The results show that the amplitude of ionospheric waves increases at the low and high latitudes(~0.15 TECU) and decreases in the mid-latitudes(~0.05 TECU). The vertical wavelength of the ionospheric waves increases in the mid-latitudes(e.g., ~50 km at altitudes of 200–250 km) and decreases at the low and high latitudes(e.g., ~35 km at altitudes of 200–250 km).The horizontal wavelength shows a similar result(e.g., ~1400 km in the mid-latitudes and ~800 km at the low and high latitudes).展开更多
We develop various statistical methods important for multidimensional genetic data analysis. Theorems justifying application of these methods are established. We concentrate on the multifactor dimensionality reduction...We develop various statistical methods important for multidimensional genetic data analysis. Theorems justifying application of these methods are established. We concentrate on the multifactor dimensionality reduction, logic regression, random forests, stochastic gradient boosting along with their new modifications. We use complementary approaches to study the risk of complex diseases such as cardiovascular ones. The roles of certain combinations of single nucleotide polymorphisms and non-genetic risk factors are examined. To perform the data analysis concerning the coronary heart disease and myocardial infarction the Lomonosov Moscow State University supercomputer “Chebyshev” was employed.展开更多
We investigate the major characteristics of the occurrences, causes of and counter measures for aircraft accidents in Japan. We apply statistical data analysis and mathematical modeling techniques to determine the rel...We investigate the major characteristics of the occurrences, causes of and counter measures for aircraft accidents in Japan. We apply statistical data analysis and mathematical modeling techniques to determine the relations among economic growth, aviation demand, the frequency of aircraft/helicopter accidents, the major characteristics of the occurrence intervals of accidents, and the number of fatalities due to accidents. The statistical model analysis suggests that the occurrence intervals of accidents and the number of fatalities can be explained by probability distributions such as the exponential distribution and the negative binomial distribution, respectively. We show that countermeasures for preventing accidents have been developed in every aircraft model, and thus they have contributed to a significant decrease in the number of accidents in the last three decades. We find that the major cause of accidents involving large airplanes has been weather, while accidents involving small airplanes and helicopters are mainly due to the pilot error. We also discover that, with respect to accidents mainly due to pilot error, there is a significant decrease in the number of accidents due to the aging of airplanes, whereas the number of accidents due to weather has barely declined. We further determine that accidents involving small and large airplanes mostly occur during takeoff and landing, whereas those involving helicopters are most likely to happen during flight. In order to decrease the number of accidents, i) enhancing safety and security by further developing technologies for aircraft, airports and air control radars, ii) establishing and improving training methods for crew including pilots, mechanics and traffic controllers, iii) tightening public rules, and iv) strengthening efforts made by individual aviation-related companies are absolutely necessary.展开更多
The most common way to analyze economics data is to use statistics software and spreadsheets.The paper presents opportunities of modern Geographical Information System (GIS) for analysis of marketing, statistical, a...The most common way to analyze economics data is to use statistics software and spreadsheets.The paper presents opportunities of modern Geographical Information System (GIS) for analysis of marketing, statistical, and macroeconomic data. It considers existing tools and models and their applications in various sectors. The advantage is that the statistical data could be combined with geographic views, maps and also additional data derived from the GIS. As a result, a programming system is developed, using GIS for analysis of marketing, statistical, macroeconomic data, and risk assessment in real time and prevention. The system has been successfully implemented as web-based software application designed for use with a variety of hardware platforms (mobile devices, laptops, and desktop computers). The software is mainly written in the programming language Python, which offers a better structure and supports for the development of large applications. Optimization of the analysis, visualization of macroeconomic, and statistical data by region for different business research are achieved. The system is designed with Geographical Information System for settlements in their respective countries and regions. Information system integration with external software packages for statistical calculations and analysis is implemented in order to share data analyzing, processing, and forecasting. Technologies and processes for loading data from different sources and tools for data analysis are developed. The successfully developed system allows implementation of qualitative data analysis.展开更多
This paper analyzes the application value of statistical analysis method of big data in economic management from the macro and micro perspectives,and analyzes its specific application from three aspects such as econom...This paper analyzes the application value of statistical analysis method of big data in economic management from the macro and micro perspectives,and analyzes its specific application from three aspects such as economic trends,industrial operations and marketing strategies.展开更多
Results of a research about statistical reasoning that six high school teachers developed in a computer environment are presented in this article. A sequence of three activities with the support of software Fathom was...Results of a research about statistical reasoning that six high school teachers developed in a computer environment are presented in this article. A sequence of three activities with the support of software Fathom was presented to the teachers in a course to investigate about the reasoning that teachers develop about the data analysis, particularly about the distribution concept, that involves important concepts such as averages, variability and graphics representations. The design of the activities was planned so that the teachers analyzed quantitative variables separately first, and later made an analysis of a qualitative variable versus a quantitative variable with the objective of establishing comparisons between distributions and use concepts as averages, variability, shape and outliers. The instructions in each activity indicated to the teachers to use all the resources of the software that were necessary to make the complete analysis and respond to certain questions that pretended to capture the type of representations they used to answer. The results indicate that despite the abundance of representations provided by the software, teachers focu,; on the calculation of averages to describe and compare distributions, rather than on the important properties of data such as variability, :shape and outliers. Many teachers were able to build interesting graphs reflecting important properties of the data, but cannot use them 1:o support data analysis. Hence, it is necessary to extend the teachers' understanding on data analysis so they can take advantage of the cognitive potential that computer tools to offer.展开更多
According to statistics of Printing and Printing Equipment Industries Association of China (PEIAC), the total output value of printing industry of China in 2007 reached 440 billion RMB , the total output value of prin...According to statistics of Printing and Printing Equipment Industries Association of China (PEIAC), the total output value of printing industry of China in 2007 reached 440 billion RMB , the total output value of printing equipment was展开更多
In order to reduce the enormous pressure to environmental monitoring work brought by the false sewage monitoring data, Grubbs method, box plot, t test and other methods are used to make depth analysis to the data, pro...In order to reduce the enormous pressure to environmental monitoring work brought by the false sewage monitoring data, Grubbs method, box plot, t test and other methods are used to make depth analysis to the data, providing a set of technological process to identify the sewage monitoring data, which is convenient and simple.展开更多
基金This research was partly supported by the Technology Development Program of MSS[No.S3033853]by the National Research Foundation of Korea(NRF)grant funded by the Korea government(MSIT)(No.2021R1A4A1031509).
文摘Statistics are most crucial than ever due to the accessibility of huge counts of data from several domains such as finance,medicine,science,engineering,and so on.Statistical data mining(SDM)is an interdisciplinary domain that examines huge existing databases to discover patterns and connections from the data.It varies in classical statistics on the size of datasets and on the detail that the data could not primarily be gathered based on some experimental strategy but conversely for other resolves.Thus,this paper introduces an effective statistical Data Mining for Intelligent Rainfall Prediction using Slime Mould Optimization with Deep Learning(SDMIRPSMODL)model.In the presented SDMIRP-SMODL model,the feature subset selection process is performed by the SMO algorithm,which in turn minimizes the computation complexity.For rainfall prediction.Convolution neural network with long short-term memory(CNN-LSTM)technique is exploited.At last,this study involves the pelican optimization algorithm(POA)as a hyperparameter optimizer.The experimental evaluation of the SDMIRP-SMODL approach is tested utilizing a rainfall dataset comprising 23682 samples in the negative class and 1865 samples in the positive class.The comparative outcomes reported the supremacy of the SDMIRP-SMODL model compared to existing techniques.
文摘There has been a significant advancement in the application of statistical tools in plant pathology during the past four decades. These tools include multivariate analysis of disease dynamics involving principal component analysis, cluster analysis, factor analysis, pattern analysis, discriminant analysis, multivariate analysis of variance, correspondence analysis, canonical correlation analysis, redundancy analysis, genetic diversity analysis, and stability analysis, which involve in joint regression, additive main effects and multiplicative interactions, and genotype-by-environment interaction biplot analysis. The advanced statistical tools, such as non-parametric analysis of disease association, meta-analysis, Bayesian analysis, and decision theory, take an important place in analysis of disease dynamics. Disease forecasting methods by simulation models for plant diseases have a great potentiality in practical disease control strategies. Common mathematical tools such as monomolecular, exponential, logistic, Gompertz and linked differential equations take an important place in growth curve analysis of disease epidemics. The highly informative means of displaying a range of numerical data through construction of box and whisker plots has been suggested. The probable applications of recent advanced tools of linear and non-linear mixed models like the linear mixed model, generalized linear model, and generalized linear mixed models have been presented. The most recent technologies such as micro-array analysis, though cost effective, provide estimates of gene expressions for thousands of genes simultaneously and need attention by the molecular biologists. Some of these advanced tools can be well applied in different branches of rice research, including crop improvement, crop production, crop protection, social sciences as well as agricultural engineering. The rice research scientists should take advantage of these new opportunities adequately in adoption of the new highly potential advanced technologies while planning experimental designs, data collection, analysis and interpretation of their research data sets.
文摘Traffic tunnels include tunnel works for traffic and transport in the areas of railway, highway, and rail transit. With many mountains and nearly one fifth of the global population, China possesses numerous large cities and megapolises with rapidly growing economies and huge traffic demands. As a result, a great deal of railway, highway, and rail transit facilities are required in this country. In the past, the construction of these facilities mainly involved subgrade and bridge works; in recent years.
文摘Two statistical validation methods were used to evaluate the confidence level of the Total Column Ozone (TCO) measurements recorded by satellite systems measuring simultaneously, one using the normal distribution and another using the Mann-Whitney test. First, the reliability of the TCO measurements was studied hemispherically. While similar coincidences and levels of significance > 0.05 were found with the two statistical tests, an enormous variability in the levels of significance throughout the year was also exposed. Then, using the same statistical comparison methods, a latitudinal study was carried out in order to elucidate the geographical distribution that gave rise to this variability. Our study reveals that between the TOMS and OMI measurements in 2005 there was only a coincidence in 50% of the latitudes, which explained the variability. This implies that for 2005, the TOMS measurements are not completely reliable, except between the -50° and -15° latitude band in the southern hemisphere and between +15° and +50° latitude band in the northern hemisphere. In the case of OMI-OMPS, we observe that between 2011 and 2016 the measurements of both satellite systems are reasonably similar with a confidence level higher than 95%. However, in 2017 a band with a width of 20° latitude centered on the equator appeared, in which the significance levels were much less than 0.05, indicating that one of the measurement systems had begun to fail. In 2018, the fault was not only located in the equator, but was also replicated in various bands in the Southern Hemisphere. We interpret this as evidence of irreversible failure in one of the measurement systems.
基金supported in part by NIH grants R01NS39600,U01MH114829RF1MH128693(to GAA)。
文摘Many fields,such as neuroscience,are experiencing the vast prolife ration of cellular data,underscoring the need fo r organizing and interpreting large datasets.A popular approach partitions data into manageable subsets via hierarchical clustering,but objective methods to determine the appropriate classification granularity are missing.We recently introduced a technique to systematically identify when to stop subdividing clusters based on the fundamental principle that cells must differ more between than within clusters.Here we present the corresponding protocol to classify cellular datasets by combining datadriven unsupervised hierarchical clustering with statistical testing.These general-purpose functions are applicable to any cellular dataset that can be organized as two-dimensional matrices of numerical values,including molecula r,physiological,and anatomical datasets.We demonstrate the protocol using cellular data from the Janelia MouseLight project to chara cterize morphological aspects of neurons.
基金support from the National Natural Science Foundation of China(NSFC,grant Nos.11973022 and 12373108)the Natural Science Foundation of Guangdong Province(No.2020A1515010710)Hanshan Normal University Startup Foundation for Doctor Scientific Research(No.QD202129)。
文摘Pulsar detection has become an active research topic in radio astronomy recently.One of the essential procedures for pulsar detection is pulsar candidate sifting(PCS),a procedure for identifying potential pulsar signals in a survey.However,pulsar candidates are always class-imbalanced,as most candidates are non-pulsars such as RFI and only a tiny part of them are from real pulsars.Class imbalance can greatly affect the performance of machine learning(ML)models,resulting in a heavy cost as some real pulsars are misjudged.To deal with the problem,techniques of choosing relevant features to discriminate pulsars from non-pulsars are focused on,which is known as feature selection.Feature selection is a process of selecting a subset of the most relevant features from a feature pool.The distinguishing features between pulsars and non-pulsars can significantly improve the performance of the classifier even if the data are highly imbalanced.In this work,an algorithm for feature selection called the K-fold Relief-Greedy(KFRG)algorithm is designed.KFRG is a two-stage algorithm.In the first stage,it filters out some irrelevant features according to their K-fold Relief scores,while in the second stage,it removes the redundant features and selects the most relevant features by a forward greedy search strategy.Experiments on the data set of the High Time Resolution Universe survey verified that ML models based on KFRG are capable of PCS,correctly separating pulsars from non-pulsars even if the candidates are highly class-imbalanced.
文摘The statistical map is usually used to indicate the quantitative features of various socio economic phenomena among regions on the base map of administrative divisions or on other base maps which connected with statistical unit. Making use of geographic information system (GIS) techniques, and supported by Auto CAD software, the author of this paper has put forward a practical method for making statistical map and developed a software (SMT) for the making of small scale statistical map using C language.
文摘The development of adaptation measures to climate change relies on data from climate models or impact models. In order to analyze these large data sets or an ensemble of these data sets, the use of statistical methods is required. In this paper, the methodological approach to collecting, structuring and publishing the methods, which have been used or developed by former or present adaptation initiatives, is described. The intention is to communicate achieved knowledge and thus support future users. A key component is the participation of users in the development process. Main elements of the approach are standardized, template-based descriptions of the methods including the specific applications, references, and method assessment. All contributions have been quality checked, sorted, and placed in a larger context. The result is a report on statistical methods which is freely available as printed or online version. Examples of how to use the methods are presented in this paper and are also included in the brochure.
文摘A novel approach to detect and filter out an unhealthy dataset from a matrix of datasets is developed, tested, and proved. The technique employs a new type of self organizing map called Accumulative Statistical Spread Map (ASSM) to establish the destructive and negative effect a dataset will have on the rest of the matrix if stayed within that matrix. The ASSM is supported by training a neural network engine, which will determine which dataset is responsible for its inability to learn, classify and predict. The carried out experiments proved that a neural system was not able to learn in the presence of such an unhealthy dataset that possessed some deviated characteristics, even though it was produced under the same conditions and through the same process as the rest of the datasets in the matrix, and hence, it should be disqualified, and either removed completely or transferred to another matrix. Such novel approach is very useful in pattern recognition of datasets and features that do not belong to their source and could be used as an effective tool to detect suspicious activities in many areas of secure filing, communication and data storage.
文摘Geomechanical data are never sufficient in quantity or adequately precise and accurate for design purposes in mining and civil engineering.The objective of this paper is to show the variability of rock properties at the sampled point in the roadway's roof,and then,how the statistical processing of the available geomechanical data can affect the results of numerical modelling of the roadway's stability.Four cases were applied in the numerical analysis,using average values(the most common in geomechanical data analysis),average minus standard deviation,median,and average value minus statistical error.The study show that different approach to the same geomechanical data set can change the modelling results considerably.The case shows that average minus standard deviation is the most conservative and least risky.It gives the displacements and yielded elements zone in four times broader range comparing to the average values scenario,which is the least conservative option.The two other cases need to be studied further.The results obtained from them are placed between most favorable and most adverse values.Taking the average values corrected by statistical error for the numerical analysis seems to be the best solution.Moreover,the confidence level can be adjusted depending on the object importance and the assumed risk level.
基金supported by the National Natural Science Foundation of China(U1931207,61602278 and 61702306)Sci.&Tech.Development Fund of Shandong Province of China(2016ZDJS02A11,ZR2017BF015 and ZR2017MF027)+1 种基金the Humanities and Social Science Research Project of the Ministry of Education(18YJAZH017)the Taishan Scholar Program of Shandong Province,and the Science and Technology Support Plan of Youth Innovation Team of Shandong Higher School(2019KJN024)。
文摘Predicting seeing of astronomical observations can provide hints of the quality of optical imaging in the near future,and facilitate flexible scheduling of observation tasks to maximize the use of astronomical observatories.Traditional approaches to seeing prediction mostly rely on regional weather models to capture the in-dome optical turbulence patterns.Thanks to the developing of data gathering and aggregation facilities of astronomical observatories in recent years,data-driven approaches are becoming increasingly feasible and attractive to predict astronomical seeing.This paper systematically investigates data-driven approaches to seeing prediction by leveraging various big data techniques,from traditional statistical modeling,machine learning to new emerging deep learning methods,on the monitoring data of the Large sky Area Multi-Object fiber Spectroscopic Telescope(LAMOST).The raw monitoring data are preprocessed to allow for big data modeling.Then we formulate the seeing prediction task under each type of modeling framework and develop seeing prediction models through using representative big data techniques,including ARIMA and Prophet for statistical modeling,MLP and XGBoost for machine learning,and LSTM,GRU and Transformer for deep learning.We perform empirical studies on the developed models with a variety of feature configurations,yielding notable insights into the applicability of big data techniques to the seeing prediction task.
基金Supported by the National Natural Science Foundation of China under Grant Nos 41774158,41474129 and 41704148the Chinese Meridian Projectthe Youth Innovation Promotion Association of the Chinese Academy of Sciences under Grant No2011324
文摘Extracting and parameterizing ionospheric waves globally and statistically is a longstanding problem. Based on the multichannel maximum entropy method(MMEM) used for studying ionospheric waves by previous work, we calculate the parameters of ionospheric waves by applying the MMEM to numerously temporally approximate and spatially close global-positioning-system radio occultation total electron content profile triples provided by the unique clustered satellites flight between years 2006 and 2007 right after the constellation observing system for meteorology, ionosphere, and climate(COSMIC) mission launch. The results show that the amplitude of ionospheric waves increases at the low and high latitudes(~0.15 TECU) and decreases in the mid-latitudes(~0.05 TECU). The vertical wavelength of the ionospheric waves increases in the mid-latitudes(e.g., ~50 km at altitudes of 200–250 km) and decreases at the low and high latitudes(e.g., ~35 km at altitudes of 200–250 km).The horizontal wavelength shows a similar result(e.g., ~1400 km in the mid-latitudes and ~800 km at the low and high latitudes).
文摘We develop various statistical methods important for multidimensional genetic data analysis. Theorems justifying application of these methods are established. We concentrate on the multifactor dimensionality reduction, logic regression, random forests, stochastic gradient boosting along with their new modifications. We use complementary approaches to study the risk of complex diseases such as cardiovascular ones. The roles of certain combinations of single nucleotide polymorphisms and non-genetic risk factors are examined. To perform the data analysis concerning the coronary heart disease and myocardial infarction the Lomonosov Moscow State University supercomputer “Chebyshev” was employed.
文摘We investigate the major characteristics of the occurrences, causes of and counter measures for aircraft accidents in Japan. We apply statistical data analysis and mathematical modeling techniques to determine the relations among economic growth, aviation demand, the frequency of aircraft/helicopter accidents, the major characteristics of the occurrence intervals of accidents, and the number of fatalities due to accidents. The statistical model analysis suggests that the occurrence intervals of accidents and the number of fatalities can be explained by probability distributions such as the exponential distribution and the negative binomial distribution, respectively. We show that countermeasures for preventing accidents have been developed in every aircraft model, and thus they have contributed to a significant decrease in the number of accidents in the last three decades. We find that the major cause of accidents involving large airplanes has been weather, while accidents involving small airplanes and helicopters are mainly due to the pilot error. We also discover that, with respect to accidents mainly due to pilot error, there is a significant decrease in the number of accidents due to the aging of airplanes, whereas the number of accidents due to weather has barely declined. We further determine that accidents involving small and large airplanes mostly occur during takeoff and landing, whereas those involving helicopters are most likely to happen during flight. In order to decrease the number of accidents, i) enhancing safety and security by further developing technologies for aircraft, airports and air control radars, ii) establishing and improving training methods for crew including pilots, mechanics and traffic controllers, iii) tightening public rules, and iv) strengthening efforts made by individual aviation-related companies are absolutely necessary.
文摘The most common way to analyze economics data is to use statistics software and spreadsheets.The paper presents opportunities of modern Geographical Information System (GIS) for analysis of marketing, statistical, and macroeconomic data. It considers existing tools and models and their applications in various sectors. The advantage is that the statistical data could be combined with geographic views, maps and also additional data derived from the GIS. As a result, a programming system is developed, using GIS for analysis of marketing, statistical, macroeconomic data, and risk assessment in real time and prevention. The system has been successfully implemented as web-based software application designed for use with a variety of hardware platforms (mobile devices, laptops, and desktop computers). The software is mainly written in the programming language Python, which offers a better structure and supports for the development of large applications. Optimization of the analysis, visualization of macroeconomic, and statistical data by region for different business research are achieved. The system is designed with Geographical Information System for settlements in their respective countries and regions. Information system integration with external software packages for statistical calculations and analysis is implemented in order to share data analyzing, processing, and forecasting. Technologies and processes for loading data from different sources and tools for data analysis are developed. The successfully developed system allows implementation of qualitative data analysis.
文摘This paper analyzes the application value of statistical analysis method of big data in economic management from the macro and micro perspectives,and analyzes its specific application from three aspects such as economic trends,industrial operations and marketing strategies.
文摘Results of a research about statistical reasoning that six high school teachers developed in a computer environment are presented in this article. A sequence of three activities with the support of software Fathom was presented to the teachers in a course to investigate about the reasoning that teachers develop about the data analysis, particularly about the distribution concept, that involves important concepts such as averages, variability and graphics representations. The design of the activities was planned so that the teachers analyzed quantitative variables separately first, and later made an analysis of a qualitative variable versus a quantitative variable with the objective of establishing comparisons between distributions and use concepts as averages, variability, shape and outliers. The instructions in each activity indicated to the teachers to use all the resources of the software that were necessary to make the complete analysis and respond to certain questions that pretended to capture the type of representations they used to answer. The results indicate that despite the abundance of representations provided by the software, teachers focu,; on the calculation of averages to describe and compare distributions, rather than on the important properties of data such as variability, :shape and outliers. Many teachers were able to build interesting graphs reflecting important properties of the data, but cannot use them 1:o support data analysis. Hence, it is necessary to extend the teachers' understanding on data analysis so they can take advantage of the cognitive potential that computer tools to offer.
文摘According to statistics of Printing and Printing Equipment Industries Association of China (PEIAC), the total output value of printing industry of China in 2007 reached 440 billion RMB , the total output value of printing equipment was
文摘In order to reduce the enormous pressure to environmental monitoring work brought by the false sewage monitoring data, Grubbs method, box plot, t test and other methods are used to make depth analysis to the data, providing a set of technological process to identify the sewage monitoring data, which is convenient and simple.