期刊文献+
共找到3篇文章
< 1 >
每页显示 20 50 100
Parallel Spectral Clustering Based on MapReduce 被引量:3
1
作者 Qiwei Zhong Yunlong Lin +3 位作者 Junyang Zou Kuangyan Zhu Qiao Wang Lei Hu 《ZTE Communications》 2013年第2期45-50,共6页
Clustering is one of the most widely used techniques for exploratory data analysis. Spectral clustering algorithm, a popular modern cluslering algorithm, has been shown to be more effective in detecting clusters than ... Clustering is one of the most widely used techniques for exploratory data analysis. Spectral clustering algorithm, a popular modern cluslering algorithm, has been shown to be more effective in detecting clusters than many traditional algorithms. It has applications ranging from computer vision and information retrieval to social sienee and biology. With the size of databases soaring, cluostering algorithms bare saling computational time and memory use. In this paper, we propose a parallel spectral elustering implementation based on MapRednee. Both the computation and data storage are dislributed, which solves the sealability problems for most existing algorithms. We empirically analyze the proposed implementation on both benchmark net- works and a real social network dataset of about two million vertices and two billion edges crawled from Sina Weibo. It is shown that the proposed implementation scales well, speeds up the clustering without sacrificing quality, and processes massive datasets efficiently on commodity machine clusters. 展开更多
关键词 spectral clustering parallel implementation massive dataset Hadoop MapRedue data mining
下载PDF
A short note on fitting a single-index model with massive data
2
作者 Rong Jiang Yexun Peng 《Statistical Theory and Related Fields》 CSCD 2023年第1期49-60,共12页
This paper studies the inference problem of index coefficient in single-index models under massive dataset.Analysis of massive dataset is challenging owing to formidable computational costs or memory requirements.A na... This paper studies the inference problem of index coefficient in single-index models under massive dataset.Analysis of massive dataset is challenging owing to formidable computational costs or memory requirements.A natural method is the averaging divide-and-conquer approach,which splits data into several blocks,obtains the estimators for each block and then aggregates the estimators via averaging.However,there is a restriction on the number of blocks.To overcome this limitation,this paper proposed a computationally efficient method,which only requires an initial estimator and then successively refines the estimator via multiple rounds of aggregations.The proposed estimator achieves the optimal convergence rate without any restriction on the number of blocks.We present both theoretical analysis and experiments to explore the property of the proposed method. 展开更多
关键词 Single-index model massive dataset divide-and-conquer method
原文传递
An adaptive lack of fit test for big data
3
作者 Yanyan Zhao Changliang Zou Zhaojun Wang 《Statistical Theory and Related Fields》 2017年第1期59-68,共10页
New technological advancements combined with powerful computer hardware and high-speed network make big data available.The massive sample size of big data introduces unique computational challenges on scalability and ... New technological advancements combined with powerful computer hardware and high-speed network make big data available.The massive sample size of big data introduces unique computational challenges on scalability and storage of statistical methods.In this paper,we focus on the lack of fit test of parametric regression models under the framework of big data.We develop a computationally feasible testing approach via integrating the divide-and-conquer algorithm into a powerful nonparametric test statistic.Our theory results show that under mild conditions,the asymptotic null distribution of the proposed test is standard normal.Furthermore,the proposed test benefits fromthe use of data-driven bandwidth procedure and thus possesses certain adaptive property.Simulation studies show that the proposed method has satisfactory performances,and it is illustrated with an analysis of an airline data. 展开更多
关键词 Adaptive test asymptotic distribution divide-and-conquer algorithm massive dataset model specification test
原文传递
上一页 1 下一页 到第
使用帮助 返回顶部