In order to avoid the redundant and inconsistent information in distributed data streams, a sampling method based on min-wise hash functions is designed and the practical semantics of the union of distributed data str...In order to avoid the redundant and inconsistent information in distributed data streams, a sampling method based on min-wise hash functions is designed and the practical semantics of the union of distributed data streams is defined. First, for each family of min-wise hash functions, the data with the minimum hash value are selected as local samples and the biased effect caused by frequent updates in a single data stream is filtered out. Secondly, for the same hash function, the sample with the minimum hash value is selected as the global sample and the local samples are combined at the center node to filter out the biased effect of duplicated updates. Finally, based on the obtained uniform samples, several aggregations on the defined semantics of the union of data streams are precisely estimated. The results of comparison tests on synthetic and real-life data streams demonstrate the effectiveness of this method.展开更多
The Circle algorithm was proposed for large datasets.The idea of the algorithm is to find a set of vertices that are close to each other and far from other vertices.This algorithm makes use of the connection between c...The Circle algorithm was proposed for large datasets.The idea of the algorithm is to find a set of vertices that are close to each other and far from other vertices.This algorithm makes use of the connection between clustering aggregation and the problem of correlation clustering.The best deterministic approximation algorithm was provided for the variation of the correlation of clustering problem,and showed how sampling can be used to scale the algorithms for large datasets.An extensive empirical evaluation was given for the usefulness of the problem and the solutions.The results show that this method achieves more than 50% reduction in the running time without sacrificing the quality of the clustering.展开更多
Privacy is becoming one of the most notable challenges threatening wireless sensor networks(WSNs).Adversaries may use RF(radio frequency) localization techniques to perform hop-by-hop trace back to the source sensor...Privacy is becoming one of the most notable challenges threatening wireless sensor networks(WSNs).Adversaries may use RF(radio frequency) localization techniques to perform hop-by-hop trace back to the source sensor's location.A multiple k-hop clusters based routing strategy(MHCR) is proposed to preserve source-location privacy as well as enhance energy efficiency for WSNs.Owing to the inherent characteristics of intra-cluster data aggregation,each sensor of the interference clusters is able to act as a fake source to confuse the adversary.Moreover,dummy traffic could be filtered efficiently by the cluster heads during the data aggregation,ensuring no energy consumption be burdened in the hotspot of the network.Through careful analysis and calculation on the distribution and the number of interference clusters,energy efficiency is significantly enhanced without reducing the network lifetime.Finally,the security and delay performance of MHCR scheme are theoretically analyzed.Extensive analysis and simulation results demonstrate that MHCR scheme can improve both the location privacy security and energy efficiency markedly,especially in large-scale WSNs.展开更多
High dimensional data clustering,with the inherent sparsity of data and the existence of noise,is a serious challenge for clustering algorithms.A new linear manifold clustering method was proposed to address this prob...High dimensional data clustering,with the inherent sparsity of data and the existence of noise,is a serious challenge for clustering algorithms.A new linear manifold clustering method was proposed to address this problem.The basic idea was to search the line manifold clusters hidden in datasets,and then fuse some of the line manifold clusters to construct higher dimensional manifold clusters.The orthogonal distance and the tangent distance were considered together as the linear manifold distance metrics. Spatial neighbor information was fully utilized to construct the original line manifold and optimize line manifolds during the line manifold cluster searching procedure.The results obtained from experiments over real and synthetic data sets demonstrate the superiority of the proposed method over some competing clustering methods in terms of accuracy and computation time.The proposed method is able to obtain high clustering accuracy for various data sets with different sizes,manifold dimensions and noise ratios,which confirms the anti-noise capability and high clustering accuracy of the proposed method for high dimensional data.展开更多
Among the available clustering algorithms in data mining, the CLOPE algorithm attracts much more attention with its high speed and good performance. However, the proper choice of some parameters in the CLOPE algorithm...Among the available clustering algorithms in data mining, the CLOPE algorithm attracts much more attention with its high speed and good performance. However, the proper choice of some parameters in the CLOPE algorithm directly affects the validity of the clustering results, which is still an open issue. For this purpose, this paper proposes a fuzzy CLOPE algorithm, and presents a method for the optimal parameter choice by defining a modified partition fuzzy degree as a clustering validity function. The experimental results with real data set illustrate the effectiveness of the proposed fuzzy CLOPE algorithm and optimal parameter choice method based on the modified partition fuzzy degree.展开更多
基金The National Natural Science Foundation of China(No60973023,60603040)the Natural Science Foundation of Southeast University(NoKJ2009362)
文摘In order to avoid the redundant and inconsistent information in distributed data streams, a sampling method based on min-wise hash functions is designed and the practical semantics of the union of distributed data streams is defined. First, for each family of min-wise hash functions, the data with the minimum hash value are selected as local samples and the biased effect caused by frequent updates in a single data stream is filtered out. Secondly, for the same hash function, the sample with the minimum hash value is selected as the global sample and the local samples are combined at the center node to filter out the biased effect of duplicated updates. Finally, based on the obtained uniform samples, several aggregations on the defined semantics of the union of data streams are precisely estimated. The results of comparison tests on synthetic and real-life data streams demonstrate the effectiveness of this method.
基金Projects(60873265,60903222) supported by the National Natural Science Foundation of China Project(IRT0661) supported by the Program for Changjiang Scholars and Innovative Research Team in University of China
文摘The Circle algorithm was proposed for large datasets.The idea of the algorithm is to find a set of vertices that are close to each other and far from other vertices.This algorithm makes use of the connection between clustering aggregation and the problem of correlation clustering.The best deterministic approximation algorithm was provided for the variation of the correlation of clustering problem,and showed how sampling can be used to scale the algorithms for large datasets.An extensive empirical evaluation was given for the usefulness of the problem and the solutions.The results show that this method achieves more than 50% reduction in the running time without sacrificing the quality of the clustering.
基金Project(2013DFB10070)supported by the International Science & Technology Cooperation Program of ChinaProject(2012GK4106)supported by the Hunan Provincial Science & Technology Program,ChinaProject(12MX15)supported by the Mittal Innovation Project of Central South University,China
文摘Privacy is becoming one of the most notable challenges threatening wireless sensor networks(WSNs).Adversaries may use RF(radio frequency) localization techniques to perform hop-by-hop trace back to the source sensor's location.A multiple k-hop clusters based routing strategy(MHCR) is proposed to preserve source-location privacy as well as enhance energy efficiency for WSNs.Owing to the inherent characteristics of intra-cluster data aggregation,each sensor of the interference clusters is able to act as a fake source to confuse the adversary.Moreover,dummy traffic could be filtered efficiently by the cluster heads during the data aggregation,ensuring no energy consumption be burdened in the hotspot of the network.Through careful analysis and calculation on the distribution and the number of interference clusters,energy efficiency is significantly enhanced without reducing the network lifetime.Finally,the security and delay performance of MHCR scheme are theoretically analyzed.Extensive analysis and simulation results demonstrate that MHCR scheme can improve both the location privacy security and energy efficiency markedly,especially in large-scale WSNs.
基金Project(60835005) supported by the National Nature Science Foundation of China
文摘High dimensional data clustering,with the inherent sparsity of data and the existence of noise,is a serious challenge for clustering algorithms.A new linear manifold clustering method was proposed to address this problem.The basic idea was to search the line manifold clusters hidden in datasets,and then fuse some of the line manifold clusters to construct higher dimensional manifold clusters.The orthogonal distance and the tangent distance were considered together as the linear manifold distance metrics. Spatial neighbor information was fully utilized to construct the original line manifold and optimize line manifolds during the line manifold cluster searching procedure.The results obtained from experiments over real and synthetic data sets demonstrate the superiority of the proposed method over some competing clustering methods in terms of accuracy and computation time.The proposed method is able to obtain high clustering accuracy for various data sets with different sizes,manifold dimensions and noise ratios,which confirms the anti-noise capability and high clustering accuracy of the proposed method for high dimensional data.
基金Supported by the National Natural Science Foundation of China (No.60202004).
文摘Among the available clustering algorithms in data mining, the CLOPE algorithm attracts much more attention with its high speed and good performance. However, the proper choice of some parameters in the CLOPE algorithm directly affects the validity of the clustering results, which is still an open issue. For this purpose, this paper proposes a fuzzy CLOPE algorithm, and presents a method for the optimal parameter choice by defining a modified partition fuzzy degree as a clustering validity function. The experimental results with real data set illustrate the effectiveness of the proposed fuzzy CLOPE algorithm and optimal parameter choice method based on the modified partition fuzzy degree.