When castings become complicated and the demands for precision of numerical simulation become higher,the numerical data of casting numerical simulation become more massive.On a general personal computer,these massive ...When castings become complicated and the demands for precision of numerical simulation become higher,the numerical data of casting numerical simulation become more massive.On a general personal computer,these massive numerical data may probably exceed the capacity of available memory,resulting in failure of rendering.Based on the out-of-core technique,this paper proposes a method to effectively utilize external storage and reduce memory usage dramatically,so as to solve the problem of insufficient memory for massive data rendering on general personal computers.Based on this method,a new postprocessor is developed.It is capable to illustrate filling and solidification processes of casting,as well as thermal stess.The new post-processor also provides fast interaction to simulation results.Theoretical analysis as well as several practical examples prove that the memory usage and loading time of the post-processor are independent of the size of the relevant files,but the proportion of the number of cells on surface.Meanwhile,the speed of rendering and fetching of value from the mouse is appreciable,and the demands of real-time and interaction are satisfied.展开更多
In this paper, we consider the unified optimal subsampling estimation and inference on the lowdimensional parameter of main interest in the presence of the nuisance parameter for low/high-dimensionalgeneralized linear...In this paper, we consider the unified optimal subsampling estimation and inference on the lowdimensional parameter of main interest in the presence of the nuisance parameter for low/high-dimensionalgeneralized linear models (GLMs) with massive data. We first present a general subsampling decorrelated scorefunction to reduce the influence of the less accurate nuisance parameter estimation with the slow convergencerate. The consistency and asymptotic normality of the resultant subsample estimator from a general decorrelatedscore subsampling algorithm are established, and two optimal subsampling probabilities are derived under theA- and L-optimality criteria to downsize the data volume and reduce the computational burden. The proposedoptimal subsampling probabilities provably improve the asymptotic efficiency of the subsampling schemes in thelow-dimensional GLMs and perform better than the uniform subsampling scheme in the high-dimensional GLMs.A two-step algorithm is further proposed to implement, and the asymptotic properties of the correspondingestimators are also given. Simulations show satisfactory performance of the proposed estimators, and twoapplications to census income and Fashion-MNIST datasets also demonstrate its practical applicability.展开更多
Nowadays,researchers are frequently confronted with challenges from massive data computing by a number of limitations of computer primary memory.Modal regression(MR)is a good alternative of the mean regression and lik...Nowadays,researchers are frequently confronted with challenges from massive data computing by a number of limitations of computer primary memory.Modal regression(MR)is a good alternative of the mean regression and likelihood based methods,because of its robustness and high efficiency.To this end,the authors extend MR to massive data analysis and propose a computationally and statistically efficient divide and conquer MR method(DC-MR).The major novelty of this method consists of splitting one entire dataset into several blocks,implementing the MR method on data in each block,and deriving final results through combining these regression results via a weighted average,which provides approximate estimates of regression results on the entire dataset.The proposed method significantly reduces the required amount of primary memory,and the resulting estimator is theoretically as efficient as the traditional MR on the entire data set.The authors also investigate a multiple hypothesis testing variable selection approach to select significant parametric components and prove the approach possessing the oracle property.In addition,the authors propose a practical modified modal expectation-maximization(MEM)algorithm for the proposed procedures.Numerical studies on simulated and real datasets are conducted to assess and showcase the practical and effective performance of our proposed methods.展开更多
Because of the limited memory of the increasing amount of information in current wearable devices,the processing capacity of the servers in the storage system can not keep up with the speed of information growth,resul...Because of the limited memory of the increasing amount of information in current wearable devices,the processing capacity of the servers in the storage system can not keep up with the speed of information growth,resulting in low load balancing,long load balancing time and data processing delay.Therefore,a data load balancing technology is applied to the massive storage systems of wearable devices in this paper.We first analyze the object-oriented load balancing method,and formally describe the dynamic load balancing issues,taking the load balancing as a mapping problem.Then,the task of assigning each data node and the request of the corresponding data node’s actual processing capacity are completed.Different data is allocated to the corresponding data storage node to complete the calculation of the comprehensive weight of the data storage node.According to the load information of each data storage node collected by the scheduler in the storage system,the load weight of the current data storage node is calculated and distributed.The data load balancing of the massive storage system for wearable devices is realized.The experimental results show that the average time of load balancing using this method is 1.75h,which is much lower than the traditional methods.The results show the data load balancing technology of the massive storage system of wearable devices has the advantages of short data load balancing time,high load balancing,strong data processing capability,short processing time and obvious application.展开更多
With increasingly complex website structure and continuously advancing web technologies,accurate user clicks recognition from massive HTTP data,which is critical for web usage mining,becomes more difficult.In this pap...With increasingly complex website structure and continuously advancing web technologies,accurate user clicks recognition from massive HTTP data,which is critical for web usage mining,becomes more difficult.In this paper,we propose a dependency graph model to describe the relationships between web requests.Based on this model,we design and implement a heuristic parallel algorithm to distinguish user clicks with the assistance of cloud computing technology.We evaluate the proposed algorithm with real massive data.The size of the dataset collected from a mobile core network is 228.7GB.It covers more than three million users.The experiment results demonstrate that the proposed algorithm can achieve higher accuracy than previous methods.展开更多
Published auxiliary information can be helpful in conducting statistical inference in a new study.In this paper,we synthesize the auxiliary information with semiparametric likelihood-based inference for censoring data...Published auxiliary information can be helpful in conducting statistical inference in a new study.In this paper,we synthesize the auxiliary information with semiparametric likelihood-based inference for censoring data with the total sample size is available.We express the auxiliary information as constraints on the regression coefficients and the covariate distribution,then use empirical likelihood method for general estimating equations to improve the efficiency of the interested parameters in the specified model.The consistency and asymptotic normality of the resulting regression parameter estimators established.Also numerical simulation and application with different supposed conditions show that the proposed method yields a substantial gain in efficiency of the interested parameters.展开更多
Massive data covert transmission scheme based on Shamir threshold is proposed in this paper. This method applies Shamir threshold scheme to divide data, uses information hiding technology to cover shadows, and realize...Massive data covert transmission scheme based on Shamir threshold is proposed in this paper. This method applies Shamir threshold scheme to divide data, uses information hiding technology to cover shadows, and realizes massive data covert transmission through transmitting stego-covers. Analysis proves that compared with the natural division method, this scheme not only improves the time-efficiency of transmitting but also enhances the security.展开更多
In this paper,we study the large-scale inference for a linear expectile regression model.To mitigate the computational challenges in the classical asymmetric least squares(ALS)estimation under massive data,we propose ...In this paper,we study the large-scale inference for a linear expectile regression model.To mitigate the computational challenges in the classical asymmetric least squares(ALS)estimation under massive data,we propose a communication-efficient divide and conquer algorithm to combine the information from sub-machines through confidence distributions.The resulting pooled estimator has a closed-form expression,and its consistency and asymptotic normality are established under mild conditions.Moreover,we derive the Bahadur representation of the ALS estimator,which serves as an important tool to study the relationship between the number of submachines K and the sample size.Numerical studies including both synthetic and real data examples are presented to illustrate the finite-sample performance of our method and support the theoretical results.展开更多
How to mine valuable information from massive multisource heterogeneous data and identify the intention of aerial targets is a major research focus at present. Aiming at the longterm dependence of air target intention...How to mine valuable information from massive multisource heterogeneous data and identify the intention of aerial targets is a major research focus at present. Aiming at the longterm dependence of air target intention recognition, this paper deeply explores the potential attribute features from the spatiotemporal sequence data of the target. First, we build an intelligent dynamic intention recognition framework, including a series of specific processes such as data source, data preprocessing,target space-time, convolutional neural networks-bidirectional gated recurrent unit-atteneion (CBA) model and intention recognition. Then, we analyze and reason the designed CBA model in detail. Finally, through comparison and analysis with other recognition model experiments, our proposed method can effectively improve the accuracy of air target intention recognition,and is of significance to the commanders’ operational command and situation prediction.展开更多
This paper designs and develops a framework on a distributed computing platform for massive multi-source spatial data using a column-oriented database(HBase).This platform consists of four layers including ETL(extract...This paper designs and develops a framework on a distributed computing platform for massive multi-source spatial data using a column-oriented database(HBase).This platform consists of four layers including ETL(extraction transformation loading) tier,data processing tier,data storage tier and data display tier,achieving long-term store,real-time analysis and inquiry for massive data.Finally,a real dataset cluster is simulated,which are made up of 39 nodes including 2 master nodes and 37 data nodes,and performing function tests of data importing module and real-time query module,and performance tests of HDFS's I/O,the MapReduce cluster,batch-loading and real-time query of massive data.The test results indicate that this platform achieves high performance in terms of response time and linear scalability.展开更多
In this paper,we consider the distributed inference for heterogeneous linear models with massive datasets.Noting that heterogeneity may exist not only in the expectations of the subpopulations,but also in their varian...In this paper,we consider the distributed inference for heterogeneous linear models with massive datasets.Noting that heterogeneity may exist not only in the expectations of the subpopulations,but also in their variances,we propose the heteroscedasticity-adaptive distributed aggregation(HADA)estimation,which is shown to be communication-efficient and asymptotically optimal,regardless of homoscedasticity or heteroscedasticity.Furthermore,a distributed test for parameter heterogeneity across subpopulations is constructed based on the HADA estimator.The finite-sample performance of the proposed methods is evaluated using simulation studies and the NYC flight data.展开更多
To make three-dimensional electromagnetic exploration achievable,the distributed wide field electromagnetic method(WFEM)based on the high-order 2^(n) sequence pseudo-random signal is proposed and realized.In this meth...To make three-dimensional electromagnetic exploration achievable,the distributed wide field electromagnetic method(WFEM)based on the high-order 2^(n) sequence pseudo-random signal is proposed and realized.In this method,only one set of high-order pseudo-random waveforms,which contains all target frequencies,is needed.Based on high-order sequence pseudo-random signal construction algorithm,the waveform can be customized according to different exploration tasks.And the receivers are independent with each other and dynamically adjust the acquisition parameters according to different requirements.A field test in the deep iron ore of Qihe−Yucheng showed that the distributed WFEM based on high-order pseudo-random signal realizes the high-efficiency acquisition of massive electromagnetic data in quite a short time.Compared with traditional controlled-source electromagnetic methods,the distributed WFEM is much more efficient.Distributed WFEM can be applied to the large scale and high-resolution exploration for deep resources and minerals.展开更多
The practical application of 3D inversion of gravity data requires a lot of computation time and storage space.To solve this problem,we present an integrated optimization algorithm with the following components:(1)tar...The practical application of 3D inversion of gravity data requires a lot of computation time and storage space.To solve this problem,we present an integrated optimization algorithm with the following components:(1)targeting high accuracy in the space domain and fast computation in the wavenumber domain,we design a fast 3D forward algorithm with high precision;and(2)taking advantage of the symmetry of the inversion matrix,the main calculation in gravity conjugate gradient inversion is decomposed into two forward calculations,thus optimizing the computational efficiency of 3D gravity inversion.We verify the calculation accuracy and efficiency of the optimization algorithm by testing various grid-number models through numerical simulation experiments.展开更多
With user-generated content, anyone can De a content creator. This phenomenon has infinitely increased the amount of information circulated online, and it is beeoming harder to efficiently obtain required information....With user-generated content, anyone can De a content creator. This phenomenon has infinitely increased the amount of information circulated online, and it is beeoming harder to efficiently obtain required information. In this paper, we describe how natural language processing and text mining can be parallelized using Hadoop and Message Passing Interface. We propose a parallel web text mining platform that processes massive amounts data quickly and efficiently. Our web knowledge service platform is designed to collect information about the IT and telecommunications industries from the web and process this in-formation using natural language processing and data-mining techniques.展开更多
Because the traditional method is difficult to obtain the internal relationshipand association rules of data when dealingwith massive data, a fuzzy clusteringmethod is proposed to analyze massive data. Firstly, the sa...Because the traditional method is difficult to obtain the internal relationshipand association rules of data when dealingwith massive data, a fuzzy clusteringmethod is proposed to analyze massive data. Firstly, the sample matrix wasnormalized through the normalization of sample data. Secondly, a fuzzy equivalencematrix was constructed by using fuzzy clustering method based on thenormalization matrix, and then the fuzzy equivalence matrix was applied as thebasis for dynamic clustering. Finally, a series of classifications were carried out onthe mass data at the cut-set level successively and a dynamic cluster diagram wasgenerated. The experimental results show that using data fuzzy clustering methodcan effectively identify association rules of data sets by multiple iterations ofmassive data, and the clustering process has short running time and good robustness.Therefore, it can be widely applied to the identification and classification ofassociation rules of massive data such as sound, image and natural resources.展开更多
Outlier detection is a very important type of data mining,which is extensively used in application areas.The traditional cell-based outlier detection algorithm not only takes a large amount of time in processing massi...Outlier detection is a very important type of data mining,which is extensively used in application areas.The traditional cell-based outlier detection algorithm not only takes a large amount of time in processing massive data,but also uses lots of machine resources,which results in the imbalance of the machine load.This paper presents an algorithm of the MapReduce-based and cell-based outlier detection,combined with the single-layer perceptron,which achieves the parallelization of outlier detection.These experiments show that this improved algorithm is able to effectively improve the efficiency of the outlier detection as well as the accuracy.展开更多
基金supported by the New Century Excellent Talents in University(NCET-09-0396)the National Science&Technology Key Projects of Numerical Control(2012ZX04014-031)+1 种基金the Natural Science Foundation of Hubei Province(2011CDB279)the Foundation for Innovative Research Groups of the Natural Science Foundation of Hubei Province,China(2010CDA067)
文摘When castings become complicated and the demands for precision of numerical simulation become higher,the numerical data of casting numerical simulation become more massive.On a general personal computer,these massive numerical data may probably exceed the capacity of available memory,resulting in failure of rendering.Based on the out-of-core technique,this paper proposes a method to effectively utilize external storage and reduce memory usage dramatically,so as to solve the problem of insufficient memory for massive data rendering on general personal computers.Based on this method,a new postprocessor is developed.It is capable to illustrate filling and solidification processes of casting,as well as thermal stess.The new post-processor also provides fast interaction to simulation results.Theoretical analysis as well as several practical examples prove that the memory usage and loading time of the post-processor are independent of the size of the relevant files,but the proportion of the number of cells on surface.Meanwhile,the speed of rendering and fetching of value from the mouse is appreciable,and the demands of real-time and interaction are satisfied.
基金This work was supported by the Fundamental Research Funds for the Central Universities,National Natural Science Foundation of China(Grant No.12271272)and the Key Laboratory for Medical Data Analysis and Statistical Research of Tianjin.
文摘In this paper, we consider the unified optimal subsampling estimation and inference on the lowdimensional parameter of main interest in the presence of the nuisance parameter for low/high-dimensionalgeneralized linear models (GLMs) with massive data. We first present a general subsampling decorrelated scorefunction to reduce the influence of the less accurate nuisance parameter estimation with the slow convergencerate. The consistency and asymptotic normality of the resultant subsample estimator from a general decorrelatedscore subsampling algorithm are established, and two optimal subsampling probabilities are derived under theA- and L-optimality criteria to downsize the data volume and reduce the computational burden. The proposedoptimal subsampling probabilities provably improve the asymptotic efficiency of the subsampling schemes in thelow-dimensional GLMs and perform better than the uniform subsampling scheme in the high-dimensional GLMs.A two-step algorithm is further proposed to implement, and the asymptotic properties of the correspondingestimators are also given. Simulations show satisfactory performance of the proposed estimators, and twoapplications to census income and Fashion-MNIST datasets also demonstrate its practical applicability.
基金supported by the Fundamental Research Funds for the Central Universities under Grant No.JBK1806002the National Natural Science Foundation of China under Grant No.11471264。
文摘Nowadays,researchers are frequently confronted with challenges from massive data computing by a number of limitations of computer primary memory.Modal regression(MR)is a good alternative of the mean regression and likelihood based methods,because of its robustness and high efficiency.To this end,the authors extend MR to massive data analysis and propose a computationally and statistically efficient divide and conquer MR method(DC-MR).The major novelty of this method consists of splitting one entire dataset into several blocks,implementing the MR method on data in each block,and deriving final results through combining these regression results via a weighted average,which provides approximate estimates of regression results on the entire dataset.The proposed method significantly reduces the required amount of primary memory,and the resulting estimator is theoretically as efficient as the traditional MR on the entire data set.The authors also investigate a multiple hypothesis testing variable selection approach to select significant parametric components and prove the approach possessing the oracle property.In addition,the authors propose a practical modified modal expectation-maximization(MEM)algorithm for the proposed procedures.Numerical studies on simulated and real datasets are conducted to assess and showcase the practical and effective performance of our proposed methods.
文摘Because of the limited memory of the increasing amount of information in current wearable devices,the processing capacity of the servers in the storage system can not keep up with the speed of information growth,resulting in low load balancing,long load balancing time and data processing delay.Therefore,a data load balancing technology is applied to the massive storage systems of wearable devices in this paper.We first analyze the object-oriented load balancing method,and formally describe the dynamic load balancing issues,taking the load balancing as a mapping problem.Then,the task of assigning each data node and the request of the corresponding data node’s actual processing capacity are completed.Different data is allocated to the corresponding data storage node to complete the calculation of the comprehensive weight of the data storage node.According to the load information of each data storage node collected by the scheduler in the storage system,the load weight of the current data storage node is calculated and distributed.The data load balancing of the massive storage system for wearable devices is realized.The experimental results show that the average time of load balancing using this method is 1.75h,which is much lower than the traditional methods.The results show the data load balancing technology of the massive storage system of wearable devices has the advantages of short data load balancing time,high load balancing,strong data processing capability,short processing time and obvious application.
基金supported in part by the Fundamental Research Funds for the Central Universities under Grant No.2013RC0114111 Project of China under Grant No.B08004
文摘With increasingly complex website structure and continuously advancing web technologies,accurate user clicks recognition from massive HTTP data,which is critical for web usage mining,becomes more difficult.In this paper,we propose a dependency graph model to describe the relationships between web requests.Based on this model,we design and implement a heuristic parallel algorithm to distinguish user clicks with the assistance of cloud computing technology.We evaluate the proposed algorithm with real massive data.The size of the dataset collected from a mobile core network is 228.7GB.It covers more than three million users.The experiment results demonstrate that the proposed algorithm can achieve higher accuracy than previous methods.
基金supported by the State Key Program of National Natural Science Foundation of China(No.71331006)by the Graduate Innovation Foundation of Shanghai University of Finance and Economics of China(No.CXJJ-2018-408)。
文摘Published auxiliary information can be helpful in conducting statistical inference in a new study.In this paper,we synthesize the auxiliary information with semiparametric likelihood-based inference for censoring data with the total sample size is available.We express the auxiliary information as constraints on the regression coefficients and the covariate distribution,then use empirical likelihood method for general estimating equations to improve the efficiency of the interested parameters in the specified model.The consistency and asymptotic normality of the resulting regression parameter estimators established.Also numerical simulation and application with different supposed conditions show that the proposed method yields a substantial gain in efficiency of the interested parameters.
基金Supported by the National High Technology Research and Development Program of China (863 Program) (2007AA0825)
文摘Massive data covert transmission scheme based on Shamir threshold is proposed in this paper. This method applies Shamir threshold scheme to divide data, uses information hiding technology to cover shadows, and realizes massive data covert transmission through transmitting stego-covers. Analysis proves that compared with the natural division method, this scheme not only improves the time-efficiency of transmitting but also enhances the security.
文摘In this paper,we study the large-scale inference for a linear expectile regression model.To mitigate the computational challenges in the classical asymmetric least squares(ALS)estimation under massive data,we propose a communication-efficient divide and conquer algorithm to combine the information from sub-machines through confidence distributions.The resulting pooled estimator has a closed-form expression,and its consistency and asymptotic normality are established under mild conditions.Moreover,we derive the Bahadur representation of the ALS estimator,which serves as an important tool to study the relationship between the number of submachines K and the sample size.Numerical studies including both synthetic and real data examples are presented to illustrate the finite-sample performance of our method and support the theoretical results.
基金supported by the National Natural Science Foundation of China (61502523)。
文摘How to mine valuable information from massive multisource heterogeneous data and identify the intention of aerial targets is a major research focus at present. Aiming at the longterm dependence of air target intention recognition, this paper deeply explores the potential attribute features from the spatiotemporal sequence data of the target. First, we build an intelligent dynamic intention recognition framework, including a series of specific processes such as data source, data preprocessing,target space-time, convolutional neural networks-bidirectional gated recurrent unit-atteneion (CBA) model and intention recognition. Then, we analyze and reason the designed CBA model in detail. Finally, through comparison and analysis with other recognition model experiments, our proposed method can effectively improve the accuracy of air target intention recognition,and is of significance to the commanders’ operational command and situation prediction.
基金Supported by the National Science and Technology Support Project(No.2012BAH01F02)from Ministry of Science and Technology of Chinathe Director Fund(No.IS201116002)from Institute of Seismology,CEA
文摘This paper designs and develops a framework on a distributed computing platform for massive multi-source spatial data using a column-oriented database(HBase).This platform consists of four layers including ETL(extraction transformation loading) tier,data processing tier,data storage tier and data display tier,achieving long-term store,real-time analysis and inquiry for massive data.Finally,a real dataset cluster is simulated,which are made up of 39 nodes including 2 master nodes and 37 data nodes,and performing function tests of data importing module and real-time query module,and performance tests of HDFS's I/O,the MapReduce cluster,batch-loading and real-time query of massive data.The test results indicate that this platform achieves high performance in terms of response time and linear scalability.
基金Supported by the National Science Foundation of China(Grant No.12271014)China Postdoctoral Science Foundation(Grant No.2022M720334)MOE(Ministry of Education in China)Project of Humanities and Social Sciences(Grant No.23YJCZH259)。
文摘In this paper,we consider the distributed inference for heterogeneous linear models with massive datasets.Noting that heterogeneity may exist not only in the expectations of the subpopulations,but also in their variances,we propose the heteroscedasticity-adaptive distributed aggregation(HADA)estimation,which is shown to be communication-efficient and asymptotically optimal,regardless of homoscedasticity or heteroscedasticity.Furthermore,a distributed test for parameter heterogeneity across subpopulations is constructed based on the HADA estimator.The finite-sample performance of the proposed methods is evaluated using simulation studies and the NYC flight data.
基金funded by the National Natural Science Foundation of China(No.42004056)the Natural Science Foundation of Shangdong Province,China(No.ZR2020QD052)China Postdoctoral Science Foundation(No.2019M652386)。
文摘To make three-dimensional electromagnetic exploration achievable,the distributed wide field electromagnetic method(WFEM)based on the high-order 2^(n) sequence pseudo-random signal is proposed and realized.In this method,only one set of high-order pseudo-random waveforms,which contains all target frequencies,is needed.Based on high-order sequence pseudo-random signal construction algorithm,the waveform can be customized according to different exploration tasks.And the receivers are independent with each other and dynamically adjust the acquisition parameters according to different requirements.A field test in the deep iron ore of Qihe−Yucheng showed that the distributed WFEM based on high-order pseudo-random signal realizes the high-efficiency acquisition of massive electromagnetic data in quite a short time.Compared with traditional controlled-source electromagnetic methods,the distributed WFEM is much more efficient.Distributed WFEM can be applied to the large scale and high-resolution exploration for deep resources and minerals.
基金Financial support by the China Geological Survey Project(Nos.DD20190030,DD20190032)
文摘The practical application of 3D inversion of gravity data requires a lot of computation time and storage space.To solve this problem,we present an integrated optimization algorithm with the following components:(1)targeting high accuracy in the space domain and fast computation in the wavenumber domain,we design a fast 3D forward algorithm with high precision;and(2)taking advantage of the symmetry of the inversion matrix,the main calculation in gravity conjugate gradient inversion is decomposed into two forward calculations,thus optimizing the computational efficiency of 3D gravity inversion.We verify the calculation accuracy and efficiency of the optimization algorithm by testing various grid-number models through numerical simulation experiments.
文摘With user-generated content, anyone can De a content creator. This phenomenon has infinitely increased the amount of information circulated online, and it is beeoming harder to efficiently obtain required information. In this paper, we describe how natural language processing and text mining can be parallelized using Hadoop and Message Passing Interface. We propose a parallel web text mining platform that processes massive amounts data quickly and efficiently. Our web knowledge service platform is designed to collect information about the IT and telecommunications industries from the web and process this in-formation using natural language processing and data-mining techniques.
文摘Because the traditional method is difficult to obtain the internal relationshipand association rules of data when dealingwith massive data, a fuzzy clusteringmethod is proposed to analyze massive data. Firstly, the sample matrix wasnormalized through the normalization of sample data. Secondly, a fuzzy equivalencematrix was constructed by using fuzzy clustering method based on thenormalization matrix, and then the fuzzy equivalence matrix was applied as thebasis for dynamic clustering. Finally, a series of classifications were carried out onthe mass data at the cut-set level successively and a dynamic cluster diagram wasgenerated. The experimental results show that using data fuzzy clustering methodcan effectively identify association rules of data sets by multiple iterations ofmassive data, and the clustering process has short running time and good robustness.Therefore, it can be widely applied to the identification and classification ofassociation rules of massive data such as sound, image and natural resources.
基金Supported by the National High Technology Research and Development Program of China(863 Program)(2012AA040910)
文摘Outlier detection is a very important type of data mining,which is extensively used in application areas.The traditional cell-based outlier detection algorithm not only takes a large amount of time in processing massive data,but also uses lots of machine resources,which results in the imbalance of the machine load.This paper presents an algorithm of the MapReduce-based and cell-based outlier detection,combined with the single-layer perceptron,which achieves the parallelization of outlier detection.These experiments show that this improved algorithm is able to effectively improve the efficiency of the outlier detection as well as the accuracy.