Clustering is one of the most widely used techniques for exploratory data analysis. Spectral clustering algorithm, a popular modern cluslering algorithm, has been shown to be more effective in detecting clusters than ...Clustering is one of the most widely used techniques for exploratory data analysis. Spectral clustering algorithm, a popular modern cluslering algorithm, has been shown to be more effective in detecting clusters than many traditional algorithms. It has applications ranging from computer vision and information retrieval to social sienee and biology. With the size of databases soaring, cluostering algorithms bare saling computational time and memory use. In this paper, we propose a parallel spectral elustering implementation based on MapRednee. Both the computation and data storage are dislributed, which solves the sealability problems for most existing algorithms. We empirically analyze the proposed implementation on both benchmark net- works and a real social network dataset of about two million vertices and two billion edges crawled from Sina Weibo. It is shown that the proposed implementation scales well, speeds up the clustering without sacrificing quality, and processes massive datasets efficiently on commodity machine clusters.展开更多
This paper studies the inference problem of index coefficient in single-index models under massive dataset.Analysis of massive dataset is challenging owing to formidable computational costs or memory requirements.A na...This paper studies the inference problem of index coefficient in single-index models under massive dataset.Analysis of massive dataset is challenging owing to formidable computational costs or memory requirements.A natural method is the averaging divide-and-conquer approach,which splits data into several blocks,obtains the estimators for each block and then aggregates the estimators via averaging.However,there is a restriction on the number of blocks.To overcome this limitation,this paper proposed a computationally efficient method,which only requires an initial estimator and then successively refines the estimator via multiple rounds of aggregations.The proposed estimator achieves the optimal convergence rate without any restriction on the number of blocks.We present both theoretical analysis and experiments to explore the property of the proposed method.展开更多
New technological advancements combined with powerful computer hardware and high-speed network make big data available.The massive sample size of big data introduces unique computational challenges on scalability and ...New technological advancements combined with powerful computer hardware and high-speed network make big data available.The massive sample size of big data introduces unique computational challenges on scalability and storage of statistical methods.In this paper,we focus on the lack of fit test of parametric regression models under the framework of big data.We develop a computationally feasible testing approach via integrating the divide-and-conquer algorithm into a powerful nonparametric test statistic.Our theory results show that under mild conditions,the asymptotic null distribution of the proposed test is standard normal.Furthermore,the proposed test benefits fromthe use of data-driven bandwidth procedure and thus possesses certain adaptive property.Simulation studies show that the proposed method has satisfactory performances,and it is illustrated with an analysis of an airline data.展开更多
文摘Clustering is one of the most widely used techniques for exploratory data analysis. Spectral clustering algorithm, a popular modern cluslering algorithm, has been shown to be more effective in detecting clusters than many traditional algorithms. It has applications ranging from computer vision and information retrieval to social sienee and biology. With the size of databases soaring, cluostering algorithms bare saling computational time and memory use. In this paper, we propose a parallel spectral elustering implementation based on MapRednee. Both the computation and data storage are dislributed, which solves the sealability problems for most existing algorithms. We empirically analyze the proposed implementation on both benchmark net- works and a real social network dataset of about two million vertices and two billion edges crawled from Sina Weibo. It is shown that the proposed implementation scales well, speeds up the clustering without sacrificing quality, and processes massive datasets efficiently on commodity machine clusters.
基金the Fundamental Research Funds for the Central Universities of China(No.2232020D-43).
文摘This paper studies the inference problem of index coefficient in single-index models under massive dataset.Analysis of massive dataset is challenging owing to formidable computational costs or memory requirements.A natural method is the averaging divide-and-conquer approach,which splits data into several blocks,obtains the estimators for each block and then aggregates the estimators via averaging.However,there is a restriction on the number of blocks.To overcome this limitation,this paper proposed a computationally efficient method,which only requires an initial estimator and then successively refines the estimator via multiple rounds of aggregations.The proposed estimator achieves the optimal convergence rate without any restriction on the number of blocks.We present both theoretical analysis and experiments to explore the property of the proposed method.
基金This paper was supported by the National Natural Science Foundation of China[grant number 11431006][grant num-ber 11690015]+1 种基金[grant number 11371202][grant number 11622104].
文摘New technological advancements combined with powerful computer hardware and high-speed network make big data available.The massive sample size of big data introduces unique computational challenges on scalability and storage of statistical methods.In this paper,we focus on the lack of fit test of parametric regression models under the framework of big data.We develop a computationally feasible testing approach via integrating the divide-and-conquer algorithm into a powerful nonparametric test statistic.Our theory results show that under mild conditions,the asymptotic null distribution of the proposed test is standard normal.Furthermore,the proposed test benefits fromthe use of data-driven bandwidth procedure and thus possesses certain adaptive property.Simulation studies show that the proposed method has satisfactory performances,and it is illustrated with an analysis of an airline data.