In order to reduce the enormous pressure to environmental monitoring work brought by the false sewage monitoring data, Grubbs method, box plot, t test and other methods are used to make depth analysis to the data, pro...In order to reduce the enormous pressure to environmental monitoring work brought by the false sewage monitoring data, Grubbs method, box plot, t test and other methods are used to make depth analysis to the data, providing a set of technological process to identify the sewage monitoring data, which is convenient and simple.展开更多
There are multiple operating modes in the real industrial process, and the collected data follow the complex multimodal distribution, so most traditional process monitoring methods are no longer applicable because the...There are multiple operating modes in the real industrial process, and the collected data follow the complex multimodal distribution, so most traditional process monitoring methods are no longer applicable because their presumptions are that sampled-data should obey the single Gaussian distribution or non-Gaussian distribution. In order to solve these problems, a novel weighted local standardization(WLS) strategy is proposed to standardize the multimodal data, which can eliminate the multi-mode characteristics of the collected data, and normalize them into unimodal data distribution. After detailed analysis of the raised data preprocessing strategy, a new algorithm using WLS strategy with support vector data description(SVDD) is put forward to apply for multi-mode monitoring process. Unlike the strategy of building multiple local models, the developed method only contains a model without the prior knowledge of multi-mode process. To demonstrate the proposed method's validity, it is applied to a numerical example and a Tennessee Eastman(TE) process. Finally, the simulation results show that the WLS strategy is very effective to standardize multimodal data, and the WLS-SVDD monitoring method has great advantages over the traditional SVDD and PCA combined with a local standardization strategy(LNS-PCA) in multi-mode process monitoring.展开更多
Atmospheric chemistry models usually perform badly in forecasting wintertime air pollution because of their uncertainties. Generally, such uncertainties can be decreased effectively by techniques such as data assimila...Atmospheric chemistry models usually perform badly in forecasting wintertime air pollution because of their uncertainties. Generally, such uncertainties can be decreased effectively by techniques such as data assimilation(DA) and model output statistics(MOS). However, the relative importance and combined effects of the two techniques have not been clarified. Here,a one-month air quality forecast with the Weather Research and Forecasting-Chemistry(WRF-Chem) model was carried out in a virtually operational setup focusing on Hebei Province, China. Meanwhile, three-dimensional variational(3 DVar) DA and MOS based on one-dimensional Kalman filtering were implemented separately and simultaneously to investigate their performance in improving the model forecast. Comparison with observations shows that the chemistry forecast with MOS outperforms that with 3 DVar DA, which could be seen in all the species tested over the whole 72 forecast hours. Combined use of both techniques does not guarantee a better forecast than MOS only, with the improvements and degradations being small and appearing rather randomly. Results indicate that the implementation of MOS is more suitable than 3 DVar DA in improving the operational forecasting ability of WRF-Chem.展开更多
Cryo-electron microscopy(cryo-EM) provides a powerful tool to resolve the structure of biological macromolecules in natural state. One advantage of cryo-EM technology is that different conformation states of a protein...Cryo-electron microscopy(cryo-EM) provides a powerful tool to resolve the structure of biological macromolecules in natural state. One advantage of cryo-EM technology is that different conformation states of a protein complex structure can be simultaneously built, and the distribution of different states can be measured. This provides a tool to push cryo-EM technology beyond just to resolve protein structures, but to obtain the thermodynamic properties of protein machines. Here, we used a deep manifold learning framework to get the conformational landscape of Kai C proteins, and further obtained the thermodynamic properties of this central oscillator component in the circadian clock by means of statistical physics.展开更多
In this paper,an interactive method is proposed to describe computer animation data and accelerate the process of animation generation.First,a semantic model and a resource description framework(RDF)are utilized to an...In this paper,an interactive method is proposed to describe computer animation data and accelerate the process of animation generation.First,a semantic model and a resource description framework(RDF)are utilized to analyze and describe the relationships between animation data.Second,a novel context model which is able to keep the context-awareness was proposed to facilitate data organization and storage.In our context model,all the main animation elements in a scene are operated as a whole.Then sketch is utilized as the main interactive method to describe the relationships between animation data,edit the context model and make some other user operations.Finally,a context-aware computer animation data description system based on sketch is generated and it also works well in animation generation process.展开更多
Complex industry processes often need multiple operation modes to meet the change of production conditions. In the same mode,there are discrete samples belonging to this mode. Therefore,it is important to consider the...Complex industry processes often need multiple operation modes to meet the change of production conditions. In the same mode,there are discrete samples belonging to this mode. Therefore,it is important to consider the samples which are sparse in the mode.To solve this issue,a new approach called density-based support vector data description( DBSVDD) is proposed. In this article,an algorithm using Gaussian mixture model( GMM) with the DBSVDD technique is proposed for process monitoring. The GMM method is used to obtain the center of each mode and determine the number of the modes. Considering the complexity of the data distribution and discrete samples in monitoring process,the DBSVDD is utilized for process monitoring. Finally,the validity and effectiveness of the DBSVDD method are illustrated through the Tennessee Eastman( TE) process.展开更多
In atmospheric data assimilation systems, the forecast error covariance model is an important component. However, the paralneters required by a forecast error covariance model are difficult to obtain due to the absenc...In atmospheric data assimilation systems, the forecast error covariance model is an important component. However, the paralneters required by a forecast error covariance model are difficult to obtain due to the absence of the truth. This study applies an error statistics estimation method to the Pfiysical-space Statistical Analysis System (PSAS) height-wind forecast error covariance model. This method consists of two components: the first component computes the error statistics by using the National Meteorological Center (NMC) method, which is a lagged-forecast difference approach, within the framework of the PSAS height-wind forecast error covariance model; the second obtains a calibration formula to rescale the error standard deviations provided by the NMC method. The calibration is against the error statistics estimated by using a maximum-likelihood estimation (MLE) with rawindsonde height observed-minus-forecast residuals. A complete set of formulas for estimating the error statistics and for the calibration is applied to a one-month-long dataset generated by a general circulation model of the Global Model and Assimilation Office (GMAO), NASA. There is a clear constant relationship between the error statistics estimates of the NMC-method and MLE. The final product provides a full set of 6-hour error statistics required by the PSAS height-wind forecast error covariance model over the globe. The features of these error statistics are examined and discussed.展开更多
It is well known that the nonparametric estimation of the regression function is highly sensitive to the presence of even a small proportion of outliers in the data.To solve the problem of typical observations when th...It is well known that the nonparametric estimation of the regression function is highly sensitive to the presence of even a small proportion of outliers in the data.To solve the problem of typical observations when the covariates of the nonparametric component are functional,the robust estimates for the regression parameter and regression operator are introduced.The main propose of the paper is to consider data-driven methods of selecting the number of neighbors in order to make the proposed processes fully automatic.We use thek Nearest Neighbors procedure(kNN)to construct the kernel estimator of the proposed robust model.Under some regularity conditions,we state consistency results for kNN functional estimators,which are uniform in the number of neighbors(UINN).Furthermore,a simulation study and an empirical application to a real data analysis of octane gasoline predictions are carried out to illustrate the higher predictive performances and the usefulness of the kNN approach.展开更多
Statistics Norway has been engaged in the development of official statistics on accidents at work for the last ten years and represents Norway in international bodies like Eurostat working groups. Some of the work was...Statistics Norway has been engaged in the development of official statistics on accidents at work for the last ten years and represents Norway in international bodies like Eurostat working groups. Some of the work was documented and presented back in 2011 at the ISI Dublin convention and a review of further developments the last four years could shed even more light over the efforts made. There has been implemented a new data collection system at the national level that involves data and files based on forms for reporting accidents at work being sent from the Norwegian Labour and Welfare Administration (NLW) to Statistics Norway. Nevertheless there are still some challenges to be met. These include the use of different versions of the NLW forms, the scanning and extraction of data in the NLW, the implementation of a secure electronic solution for transmitting data between the NLW and Statistics Norway, the reading and interpretation of tiff-files and the lessons to be learned from other countries. The ambition is that Statistics Norway produces methodological sound official statistics on accidents at work within the first half of 2015 and transmits data and files to Eurostat that are necessary and sufficient to fulfil EU regulations within the first half of 2016.展开更多
Air quality monitoring is effective for timely understanding of the current air quality status of a region or city.Currently,the huge volume of environmental monitoring data,which has reasonable real-time performance,...Air quality monitoring is effective for timely understanding of the current air quality status of a region or city.Currently,the huge volume of environmental monitoring data,which has reasonable real-time performance,provides strong support for in-depth analysis of air pollution characteristics and causes.However,in the era of big data,to meet current demands for fine management of the atmospheric environment,it is important to explore the characteristics and causes of air pollution from multiple aspects for comprehensive and scientific evaluation of air quality.This study reviewed and summarized air quality evaluation methods on the basis of environmental monitoring data statistics during the 13th Five-Year Plan period,and evaluated the level of air pollution in the Beijing-Tianjin-Hebei region and its surrounding areas(i.e.,the“2+26”region)during the period of the three-year action plan to fight air pollution.We suggest that air quality should be comprehensively,deeply,and scientifically evaluated from the aspects of air pollution characteristics,causes,and influences of meteorological conditions and anthropogenic emissions.It is also suggested that a threeyear moving average be introduced as one of the evaluation indexes of long-term change of pollutants.Additionally,both temporal and spatial differences should be considered when removing confounding meteorological factors.展开更多
Many fields,such as neuroscience,are experiencing the vast prolife ration of cellular data,underscoring the need fo r organizing and interpreting large datasets.A popular approach partitions data into manageable subse...Many fields,such as neuroscience,are experiencing the vast prolife ration of cellular data,underscoring the need fo r organizing and interpreting large datasets.A popular approach partitions data into manageable subsets via hierarchical clustering,but objective methods to determine the appropriate classification granularity are missing.We recently introduced a technique to systematically identify when to stop subdividing clusters based on the fundamental principle that cells must differ more between than within clusters.Here we present the corresponding protocol to classify cellular datasets by combining datadriven unsupervised hierarchical clustering with statistical testing.These general-purpose functions are applicable to any cellular dataset that can be organized as two-dimensional matrices of numerical values,including molecula r,physiological,and anatomical datasets.We demonstrate the protocol using cellular data from the Janelia MouseLight project to chara cterize morphological aspects of neurons.展开更多
Pulsar detection has become an active research topic in radio astronomy recently.One of the essential procedures for pulsar detection is pulsar candidate sifting(PCS),a procedure for identifying potential pulsar signa...Pulsar detection has become an active research topic in radio astronomy recently.One of the essential procedures for pulsar detection is pulsar candidate sifting(PCS),a procedure for identifying potential pulsar signals in a survey.However,pulsar candidates are always class-imbalanced,as most candidates are non-pulsars such as RFI and only a tiny part of them are from real pulsars.Class imbalance can greatly affect the performance of machine learning(ML)models,resulting in a heavy cost as some real pulsars are misjudged.To deal with the problem,techniques of choosing relevant features to discriminate pulsars from non-pulsars are focused on,which is known as feature selection.Feature selection is a process of selecting a subset of the most relevant features from a feature pool.The distinguishing features between pulsars and non-pulsars can significantly improve the performance of the classifier even if the data are highly imbalanced.In this work,an algorithm for feature selection called the K-fold Relief-Greedy(KFRG)algorithm is designed.KFRG is a two-stage algorithm.In the first stage,it filters out some irrelevant features according to their K-fold Relief scores,while in the second stage,it removes the redundant features and selects the most relevant features by a forward greedy search strategy.Experiments on the data set of the High Time Resolution Universe survey verified that ML models based on KFRG are capable of PCS,correctly separating pulsars from non-pulsars even if the candidates are highly class-imbalanced.展开更多
Data breaches have massive consequences for companies, affecting them financially and undermining their reputation, which poses significant challenges to online security and the long-term viability of businesses. This...Data breaches have massive consequences for companies, affecting them financially and undermining their reputation, which poses significant challenges to online security and the long-term viability of businesses. This study analyzes trends in data breaches in the United States, examining the frequency, causes, and magnitude of breaches across various industries. We document that data breaches are increasing, with hacking emerging as the leading cause. Our descriptive analyses explore factors influencing breaches, including security vulnerabilities, human error, and malicious attacks. The findings provide policymakers and businesses with actionable insights to bolster data security through proactive audits, patching, encryption, and response planning. By better understanding breach patterns and risk factors, organizations can take targeted steps to enhance protections and mitigate the potential damage of future incidents.展开更多
The founding conference of the Big Data Statistics Branch (BDSB) of the Chinese Association forApplied Statistics (CAAS) was held on 8 December 2018, at East China Normal University (ECNU),Shanghai, China. More than 6...The founding conference of the Big Data Statistics Branch (BDSB) of the Chinese Association forApplied Statistics (CAAS) was held on 8 December 2018, at East China Normal University (ECNU),Shanghai, China. More than 600 experts and scholars attended the conference. Professor ZhangRiquan was elected as the chairman of the first Board of Directors of the BDSB. Fang Xiangzhong,Chairman of the CAAS, delivered a speech. Professor Wang Zhaojun and Dr Liu Zhong delivered,respectively, keynote reports on the development of Big Data researches and practices, at theconference. The BDSB will be dedicated to building a high-level big data statistics exchange platform for experts and scholars in universities, governments, enterprises, and other fields to betterserve the society and serve the country’s major strategies.展开更多
Human living would be impossible without air quality. Consistent advancements in practically every aspect of contemporary human life have harmed air quality. Everyday industrial, transportation, and home activities tu...Human living would be impossible without air quality. Consistent advancements in practically every aspect of contemporary human life have harmed air quality. Everyday industrial, transportation, and home activities turn up dangerous contaminants in our surroundings. This study investigated two years’ worth of air quality and outlier detection data from two Indian cities. Studies on air pollution have used numerous types of methodologies, with various gases being seen as a vector whose components include gas concentration values for each observation per-formed. We use curves to represent the monthly average of daily gas emissions in our technique. The approach, which is based on functional depth, was used to find outliers in the city of Delhi and Kolkata’s gas emissions, and the outcomes were compared to those from the traditional method. In the evaluation and comparison of these models’ performances, the functional approach model studied well.展开更多
Feature representation is one of the key issues in data clustering. The existing feature representation of scientific data is not sufficient, which to some extent affects the result of scientific data clustering. Ther...Feature representation is one of the key issues in data clustering. The existing feature representation of scientific data is not sufficient, which to some extent affects the result of scientific data clustering. Therefore, the paper proposes a concept of composite text description(CTD) and a CTD-based feature representation method for biomedical scientific data. The method mainly uses different feature weight algorisms to represent candidate features based on two types of data sources respectively, combines and finally strengthens the two feature sets. Experiments show that comparing with traditional methods, the feature representation method is more effective than traditional methods and can significantly improve the performance of biomedcial data clustering.展开更多
Statistics are most crucial than ever due to the accessibility of huge counts of data from several domains such as finance,medicine,science,engineering,and so on.Statistical data mining(SDM)is an interdisciplinary dom...Statistics are most crucial than ever due to the accessibility of huge counts of data from several domains such as finance,medicine,science,engineering,and so on.Statistical data mining(SDM)is an interdisciplinary domain that examines huge existing databases to discover patterns and connections from the data.It varies in classical statistics on the size of datasets and on the detail that the data could not primarily be gathered based on some experimental strategy but conversely for other resolves.Thus,this paper introduces an effective statistical Data Mining for Intelligent Rainfall Prediction using Slime Mould Optimization with Deep Learning(SDMIRPSMODL)model.In the presented SDMIRP-SMODL model,the feature subset selection process is performed by the SMO algorithm,which in turn minimizes the computation complexity.For rainfall prediction.Convolution neural network with long short-term memory(CNN-LSTM)technique is exploited.At last,this study involves the pelican optimization algorithm(POA)as a hyperparameter optimizer.The experimental evaluation of the SDMIRP-SMODL approach is tested utilizing a rainfall dataset comprising 23682 samples in the negative class and 1865 samples in the positive class.The comparative outcomes reported the supremacy of the SDMIRP-SMODL model compared to existing techniques.展开更多
文摘In order to reduce the enormous pressure to environmental monitoring work brought by the false sewage monitoring data, Grubbs method, box plot, t test and other methods are used to make depth analysis to the data, providing a set of technological process to identify the sewage monitoring data, which is convenient and simple.
基金Project(61374140)supported by the National Natural Science Foundation of China
文摘There are multiple operating modes in the real industrial process, and the collected data follow the complex multimodal distribution, so most traditional process monitoring methods are no longer applicable because their presumptions are that sampled-data should obey the single Gaussian distribution or non-Gaussian distribution. In order to solve these problems, a novel weighted local standardization(WLS) strategy is proposed to standardize the multimodal data, which can eliminate the multi-mode characteristics of the collected data, and normalize them into unimodal data distribution. After detailed analysis of the raised data preprocessing strategy, a new algorithm using WLS strategy with support vector data description(SVDD) is put forward to apply for multi-mode monitoring process. Unlike the strategy of building multiple local models, the developed method only contains a model without the prior knowledge of multi-mode process. To demonstrate the proposed method's validity, it is applied to a numerical example and a Tennessee Eastman(TE) process. Finally, the simulation results show that the WLS strategy is very effective to standardize multimodal data, and the WLS-SVDD monitoring method has great advantages over the traditional SVDD and PCA combined with a local standardization strategy(LNS-PCA) in multi-mode process monitoring.
基金supported by the State Key Research and Development Program (Grant Nos. 2017YFC0209803, 2016YFC0208504, 2016YFC0203303 and 2017YFC0210106)the National Natural Science Foundation of China (Grant Nos. 91544230, 41575145, 41621005 and 41275128)
文摘Atmospheric chemistry models usually perform badly in forecasting wintertime air pollution because of their uncertainties. Generally, such uncertainties can be decreased effectively by techniques such as data assimilation(DA) and model output statistics(MOS). However, the relative importance and combined effects of the two techniques have not been clarified. Here,a one-month air quality forecast with the Weather Research and Forecasting-Chemistry(WRF-Chem) model was carried out in a virtually operational setup focusing on Hebei Province, China. Meanwhile, three-dimensional variational(3 DVar) DA and MOS based on one-dimensional Kalman filtering were implemented separately and simultaneously to investigate their performance in improving the model forecast. Comparison with observations shows that the chemistry forecast with MOS outperforms that with 3 DVar DA, which could be seen in all the species tested over the whole 72 forecast hours. Combined use of both techniques does not guarantee a better forecast than MOS only, with the improvements and degradations being small and appearing rather randomly. Results indicate that the implementation of MOS is more suitable than 3 DVar DA in improving the operational forecasting ability of WRF-Chem.
基金supported by the National Natural Science Foundation of China (Grant No. 12090054)。
文摘Cryo-electron microscopy(cryo-EM) provides a powerful tool to resolve the structure of biological macromolecules in natural state. One advantage of cryo-EM technology is that different conformation states of a protein complex structure can be simultaneously built, and the distribution of different states can be measured. This provides a tool to push cryo-EM technology beyond just to resolve protein structures, but to obtain the thermodynamic properties of protein machines. Here, we used a deep manifold learning framework to get the conformational landscape of Kai C proteins, and further obtained the thermodynamic properties of this central oscillator component in the circadian clock by means of statistical physics.
基金Supported by the National Key Research and Development Plan(2016YFB1001200)the National Natural Science Foundation of China(U1435220,61232013)
文摘In this paper,an interactive method is proposed to describe computer animation data and accelerate the process of animation generation.First,a semantic model and a resource description framework(RDF)are utilized to analyze and describe the relationships between animation data.Second,a novel context model which is able to keep the context-awareness was proposed to facilitate data organization and storage.In our context model,all the main animation elements in a scene are operated as a whole.Then sketch is utilized as the main interactive method to describe the relationships between animation data,edit the context model and make some other user operations.Finally,a context-aware computer animation data description system based on sketch is generated and it also works well in animation generation process.
基金National Natural Science Foundation of China(No.61374140)the Youth Foundation of National Natural Science Foundation of China(No.61403072)
文摘Complex industry processes often need multiple operation modes to meet the change of production conditions. In the same mode,there are discrete samples belonging to this mode. Therefore,it is important to consider the samples which are sparse in the mode.To solve this issue,a new approach called density-based support vector data description( DBSVDD) is proposed. In this article,an algorithm using Gaussian mixture model( GMM) with the DBSVDD technique is proposed for process monitoring. The GMM method is used to obtain the center of each mode and determine the number of the modes. Considering the complexity of the data distribution and discrete samples in monitoring process,the DBSVDD is utilized for process monitoring. Finally,the validity and effectiveness of the DBSVDD method are illustrated through the Tennessee Eastman( TE) process.
文摘In atmospheric data assimilation systems, the forecast error covariance model is an important component. However, the paralneters required by a forecast error covariance model are difficult to obtain due to the absence of the truth. This study applies an error statistics estimation method to the Pfiysical-space Statistical Analysis System (PSAS) height-wind forecast error covariance model. This method consists of two components: the first component computes the error statistics by using the National Meteorological Center (NMC) method, which is a lagged-forecast difference approach, within the framework of the PSAS height-wind forecast error covariance model; the second obtains a calibration formula to rescale the error standard deviations provided by the NMC method. The calibration is against the error statistics estimated by using a maximum-likelihood estimation (MLE) with rawindsonde height observed-minus-forecast residuals. A complete set of formulas for estimating the error statistics and for the calibration is applied to a one-month-long dataset generated by a general circulation model of the Global Model and Assimilation Office (GMAO), NASA. There is a clear constant relationship between the error statistics estimates of the NMC-method and MLE. The final product provides a full set of 6-hour error statistics required by the PSAS height-wind forecast error covariance model over the globe. The features of these error statistics are examined and discussed.
文摘It is well known that the nonparametric estimation of the regression function is highly sensitive to the presence of even a small proportion of outliers in the data.To solve the problem of typical observations when the covariates of the nonparametric component are functional,the robust estimates for the regression parameter and regression operator are introduced.The main propose of the paper is to consider data-driven methods of selecting the number of neighbors in order to make the proposed processes fully automatic.We use thek Nearest Neighbors procedure(kNN)to construct the kernel estimator of the proposed robust model.Under some regularity conditions,we state consistency results for kNN functional estimators,which are uniform in the number of neighbors(UINN).Furthermore,a simulation study and an empirical application to a real data analysis of octane gasoline predictions are carried out to illustrate the higher predictive performances and the usefulness of the kNN approach.
文摘Statistics Norway has been engaged in the development of official statistics on accidents at work for the last ten years and represents Norway in international bodies like Eurostat working groups. Some of the work was documented and presented back in 2011 at the ISI Dublin convention and a review of further developments the last four years could shed even more light over the efforts made. There has been implemented a new data collection system at the national level that involves data and files based on forms for reporting accidents at work being sent from the Norwegian Labour and Welfare Administration (NLW) to Statistics Norway. Nevertheless there are still some challenges to be met. These include the use of different versions of the NLW forms, the scanning and extraction of data in the NLW, the implementation of a secure electronic solution for transmitting data between the NLW and Statistics Norway, the reading and interpretation of tiff-files and the lessons to be learned from other countries. The ambition is that Statistics Norway produces methodological sound official statistics on accidents at work within the first half of 2015 and transmits data and files to Eurostat that are necessary and sufficient to fulfil EU regulations within the first half of 2016.
基金supported by the National Key Research and Development Program of China(No.2019YFC0214800)。
文摘Air quality monitoring is effective for timely understanding of the current air quality status of a region or city.Currently,the huge volume of environmental monitoring data,which has reasonable real-time performance,provides strong support for in-depth analysis of air pollution characteristics and causes.However,in the era of big data,to meet current demands for fine management of the atmospheric environment,it is important to explore the characteristics and causes of air pollution from multiple aspects for comprehensive and scientific evaluation of air quality.This study reviewed and summarized air quality evaluation methods on the basis of environmental monitoring data statistics during the 13th Five-Year Plan period,and evaluated the level of air pollution in the Beijing-Tianjin-Hebei region and its surrounding areas(i.e.,the“2+26”region)during the period of the three-year action plan to fight air pollution.We suggest that air quality should be comprehensively,deeply,and scientifically evaluated from the aspects of air pollution characteristics,causes,and influences of meteorological conditions and anthropogenic emissions.It is also suggested that a threeyear moving average be introduced as one of the evaluation indexes of long-term change of pollutants.Additionally,both temporal and spatial differences should be considered when removing confounding meteorological factors.
基金supported in part by NIH grants R01NS39600,U01MH114829RF1MH128693(to GAA)。
文摘Many fields,such as neuroscience,are experiencing the vast prolife ration of cellular data,underscoring the need fo r organizing and interpreting large datasets.A popular approach partitions data into manageable subsets via hierarchical clustering,but objective methods to determine the appropriate classification granularity are missing.We recently introduced a technique to systematically identify when to stop subdividing clusters based on the fundamental principle that cells must differ more between than within clusters.Here we present the corresponding protocol to classify cellular datasets by combining datadriven unsupervised hierarchical clustering with statistical testing.These general-purpose functions are applicable to any cellular dataset that can be organized as two-dimensional matrices of numerical values,including molecula r,physiological,and anatomical datasets.We demonstrate the protocol using cellular data from the Janelia MouseLight project to chara cterize morphological aspects of neurons.
基金support from the National Natural Science Foundation of China(NSFC,grant Nos.11973022 and 12373108)the Natural Science Foundation of Guangdong Province(No.2020A1515010710)Hanshan Normal University Startup Foundation for Doctor Scientific Research(No.QD202129)。
文摘Pulsar detection has become an active research topic in radio astronomy recently.One of the essential procedures for pulsar detection is pulsar candidate sifting(PCS),a procedure for identifying potential pulsar signals in a survey.However,pulsar candidates are always class-imbalanced,as most candidates are non-pulsars such as RFI and only a tiny part of them are from real pulsars.Class imbalance can greatly affect the performance of machine learning(ML)models,resulting in a heavy cost as some real pulsars are misjudged.To deal with the problem,techniques of choosing relevant features to discriminate pulsars from non-pulsars are focused on,which is known as feature selection.Feature selection is a process of selecting a subset of the most relevant features from a feature pool.The distinguishing features between pulsars and non-pulsars can significantly improve the performance of the classifier even if the data are highly imbalanced.In this work,an algorithm for feature selection called the K-fold Relief-Greedy(KFRG)algorithm is designed.KFRG is a two-stage algorithm.In the first stage,it filters out some irrelevant features according to their K-fold Relief scores,while in the second stage,it removes the redundant features and selects the most relevant features by a forward greedy search strategy.Experiments on the data set of the High Time Resolution Universe survey verified that ML models based on KFRG are capable of PCS,correctly separating pulsars from non-pulsars even if the candidates are highly class-imbalanced.
文摘Data breaches have massive consequences for companies, affecting them financially and undermining their reputation, which poses significant challenges to online security and the long-term viability of businesses. This study analyzes trends in data breaches in the United States, examining the frequency, causes, and magnitude of breaches across various industries. We document that data breaches are increasing, with hacking emerging as the leading cause. Our descriptive analyses explore factors influencing breaches, including security vulnerabilities, human error, and malicious attacks. The findings provide policymakers and businesses with actionable insights to bolster data security through proactive audits, patching, encryption, and response planning. By better understanding breach patterns and risk factors, organizations can take targeted steps to enhance protections and mitigate the potential damage of future incidents.
文摘The founding conference of the Big Data Statistics Branch (BDSB) of the Chinese Association forApplied Statistics (CAAS) was held on 8 December 2018, at East China Normal University (ECNU),Shanghai, China. More than 600 experts and scholars attended the conference. Professor ZhangRiquan was elected as the chairman of the first Board of Directors of the BDSB. Fang Xiangzhong,Chairman of the CAAS, delivered a speech. Professor Wang Zhaojun and Dr Liu Zhong delivered,respectively, keynote reports on the development of Big Data researches and practices, at theconference. The BDSB will be dedicated to building a high-level big data statistics exchange platform for experts and scholars in universities, governments, enterprises, and other fields to betterserve the society and serve the country’s major strategies.
文摘Human living would be impossible without air quality. Consistent advancements in practically every aspect of contemporary human life have harmed air quality. Everyday industrial, transportation, and home activities turn up dangerous contaminants in our surroundings. This study investigated two years’ worth of air quality and outlier detection data from two Indian cities. Studies on air pollution have used numerous types of methodologies, with various gases being seen as a vector whose components include gas concentration values for each observation per-formed. We use curves to represent the monthly average of daily gas emissions in our technique. The approach, which is based on functional depth, was used to find outliers in the city of Delhi and Kolkata’s gas emissions, and the outcomes were compared to those from the traditional method. In the evaluation and comparison of these models’ performances, the functional approach model studied well.
基金supported by the Agridata,the sub-program of National Science and Technology Infrastructure Program(Grant No.2005DKA31800)
文摘Feature representation is one of the key issues in data clustering. The existing feature representation of scientific data is not sufficient, which to some extent affects the result of scientific data clustering. Therefore, the paper proposes a concept of composite text description(CTD) and a CTD-based feature representation method for biomedical scientific data. The method mainly uses different feature weight algorisms to represent candidate features based on two types of data sources respectively, combines and finally strengthens the two feature sets. Experiments show that comparing with traditional methods, the feature representation method is more effective than traditional methods and can significantly improve the performance of biomedcial data clustering.
基金This research was partly supported by the Technology Development Program of MSS[No.S3033853]by the National Research Foundation of Korea(NRF)grant funded by the Korea government(MSIT)(No.2021R1A4A1031509).
文摘Statistics are most crucial than ever due to the accessibility of huge counts of data from several domains such as finance,medicine,science,engineering,and so on.Statistical data mining(SDM)is an interdisciplinary domain that examines huge existing databases to discover patterns and connections from the data.It varies in classical statistics on the size of datasets and on the detail that the data could not primarily be gathered based on some experimental strategy but conversely for other resolves.Thus,this paper introduces an effective statistical Data Mining for Intelligent Rainfall Prediction using Slime Mould Optimization with Deep Learning(SDMIRPSMODL)model.In the presented SDMIRP-SMODL model,the feature subset selection process is performed by the SMO algorithm,which in turn minimizes the computation complexity.For rainfall prediction.Convolution neural network with long short-term memory(CNN-LSTM)technique is exploited.At last,this study involves the pelican optimization algorithm(POA)as a hyperparameter optimizer.The experimental evaluation of the SDMIRP-SMODL approach is tested utilizing a rainfall dataset comprising 23682 samples in the negative class and 1865 samples in the positive class.The comparative outcomes reported the supremacy of the SDMIRP-SMODL model compared to existing techniques.