Functional land use maps are used for land evaluation, environmental analysis, and resource conservation. Spatial data clustering identifies the sparse and crowded places, thus discovering the overall distribution pat...Functional land use maps are used for land evaluation, environmental analysis, and resource conservation. Spatial data clustering identifies the sparse and crowded places, thus discovering the overall distribution pattern of the dataset. Some clustering methods represent an attribute-oriented approach to knowledge discovery. Other methods rely on natural notions of similarities (e.g., Euclidean distances). These are not appropriate for constructing functional areas. We propose a similarity value to evaluate the closeness between a pair of points based on the total functional area and the proportion of the main land use type for the entire functional area. We develop constrained attributes employing this similarity value and a DT (Delaunay triangulation) criterion function when merging clusters. Four thresholds are set to ensure that functional areas have acceptable proportions, regular shapes, and no overlap. An experimental study was conducted with cadastral data for Chengdu, China, from 2009. The results show the advantages for objectivity and efficiency in using the proposed algorithm to define functional areas. The areas are created dynamically at any convenient time.展开更多
Purpose-The aim of this study is to propose a deep neural network(DNN)method that uses side information to improve clustering results for big datasets;also,the authors show that applying this information improves the ...Purpose-The aim of this study is to propose a deep neural network(DNN)method that uses side information to improve clustering results for big datasets;also,the authors show that applying this information improves the performance of clustering and also increase the speed of the network training convergence.Design/methodology/approach-In data mining,semisupervised learning is an interesting approach because good performance can be achieved with a small subset of labeled data;one reason is that the data labeling is expensive,and semisupervised learning does not need all labels.One type of semisupervised learning is constrained clustering;this type of learning does not use class labels for clustering.Instead,it uses information of some pairs of instances(side information),and these instances maybe are in the same cluster(must-link[ML])or in different clusters(cannot-link[CL]).Constrained clustering was studied extensively;however,little works have focused on constrained clustering for big datasets.In this paper,the authors have presented a constrained clustering for big datasets,and the method uses a DNN.The authors inject the constraints(ML and CL)to this DNN to promote the clustering performance and call it constrained deep embedded clustering(CDEC).In this manner,an autoencoder was implemented to elicit informative low dimensional features in the latent space and then retrain the encoder network using a proposed Kullback-Leibler divergence objective function,which captures the constraints in order to cluster the projected samples.The proposed CDEC has been compared with the adversarial autoencoder,constrained 1-spectral clustering and autoencoder t k-means was applied to the known MNIST,Reuters-10k and USPS datasets,and their performance were assessed in terms of clustering accuracy.Empirical results confirmed the statistical superiority of CDEC in terms of clustering accuracy to the counterparts.Findings-First of all,this is the first DNN-constrained clustering that uses side information to improve the performance of clustering without using labels in big datasets with high dimension.Second,the author defined a formula to inject side information to the DNN.Third,the proposed method improves clustering performance and network convergence speed.Originality/value-Little works have focused on constrained clustering for big datasets;also,the studies in DNNs for clustering,with specific loss function that simultaneously extract features and clustering the data,are rare.The method improves the performance of big data clustering without using labels,and it is important because the data labeling is expensive and time-consuming,especially for big datasets.展开更多
Purpose–Constrained clustering is an important recent development in clustering literature.The goal of an algorithm in constrained clustering research is to improve the quality of clustering by making use of backgrou...Purpose–Constrained clustering is an important recent development in clustering literature.The goal of an algorithm in constrained clustering research is to improve the quality of clustering by making use of background knowledge.The purpose of this paper is to suggest a new perspective for constrained clustering,by finding an effective transformation of data into target space on the reference of background knowledge given in the form of pairwise must-and cannot-link constraints.Design/methodology/approach–Most of existing methods in constrained clustering are limited to learn a distance metric or kernel matrix from the background knowledge while looking for transformation of data in target space.Unlike previous efforts,the author presents a non-linear method for constraint clustering,whose basic idea is to use different non-linear functions for each dimension in target space.Findings–The outcome of the paper is a novel non-linear method for constrained clustering which uses different non-linearfunctions for each dimension in target space.The proposed method for a particular case is formulated and explained for quadratic functions.To reduce the number of optimization parameters,the proposed method is modified to relax the quadratic function and approximate it by a factorized version that is easier to solve.Experimental results on synthetic and real-world data demonstrate the efficacy of the proposed method.Originality/value–This study proposes a new direction to the problem of constrained clustering by learning a non-linear transformation of data into target space without using kernel functions.This work will assist researchers to start development of new methods based on the proposed framework which will potentially provide them with new research topics.展开更多
Constrained clustering,such as k-means with instance-level Must-Link(ML)and Cannot-Link(CL)auxiliary information as the constraints,has been extensively studied recently,due to its broad applications in data science a...Constrained clustering,such as k-means with instance-level Must-Link(ML)and Cannot-Link(CL)auxiliary information as the constraints,has been extensively studied recently,due to its broad applications in data science and AI.Despite some heuristic approaches,there has not been any algorithm providing a non-trivial approximation ratio to the constrained k-means problem.To address this issue,we propose an algorithm with a provable approximation ratio of O(logk)when only ML constraints are considered.We also empirically evaluate the performance of our algorithm on real-world datasets having artificial ML and disjoint CL constraints.The experimental results show that our algorithm outperforms the existing greedy-based heuristic methods in clustering accuracy.展开更多
As a kind of weaker supervisory information, pairwise constraints can be exploited to guide the data analysis process, such as data clustering. This paper formulates pairwise constraint propagation, which aims to pred...As a kind of weaker supervisory information, pairwise constraints can be exploited to guide the data analysis process, such as data clustering. This paper formulates pairwise constraint propagation, which aims to predict the large quantity of unknown constraints from scarce known constraints, as a low-rank matrix recovery(LMR) problem. Although recent advances in transductive learning based on matrix completion can be directly adopted to solve this problem, our work intends to develop a more general low-rank matrix recovery solution for pairwise constraint propagation, which not only completes the unknown entries in the constraint matrix but also removes the noise from the data matrix. The problem can be effectively solved using an augmented Lagrange multiplier method. Experimental results on constrained clustering tasks based on the propagated pairwise constraints have shown that our method can obtain more stable results than state-of-the-art algorithms,and outperform them.展开更多
文摘Functional land use maps are used for land evaluation, environmental analysis, and resource conservation. Spatial data clustering identifies the sparse and crowded places, thus discovering the overall distribution pattern of the dataset. Some clustering methods represent an attribute-oriented approach to knowledge discovery. Other methods rely on natural notions of similarities (e.g., Euclidean distances). These are not appropriate for constructing functional areas. We propose a similarity value to evaluate the closeness between a pair of points based on the total functional area and the proportion of the main land use type for the entire functional area. We develop constrained attributes employing this similarity value and a DT (Delaunay triangulation) criterion function when merging clusters. Four thresholds are set to ensure that functional areas have acceptable proportions, regular shapes, and no overlap. An experimental study was conducted with cadastral data for Chengdu, China, from 2009. The results show the advantages for objectivity and efficiency in using the proposed algorithm to define functional areas. The areas are created dynamically at any convenient time.
文摘Purpose-The aim of this study is to propose a deep neural network(DNN)method that uses side information to improve clustering results for big datasets;also,the authors show that applying this information improves the performance of clustering and also increase the speed of the network training convergence.Design/methodology/approach-In data mining,semisupervised learning is an interesting approach because good performance can be achieved with a small subset of labeled data;one reason is that the data labeling is expensive,and semisupervised learning does not need all labels.One type of semisupervised learning is constrained clustering;this type of learning does not use class labels for clustering.Instead,it uses information of some pairs of instances(side information),and these instances maybe are in the same cluster(must-link[ML])or in different clusters(cannot-link[CL]).Constrained clustering was studied extensively;however,little works have focused on constrained clustering for big datasets.In this paper,the authors have presented a constrained clustering for big datasets,and the method uses a DNN.The authors inject the constraints(ML and CL)to this DNN to promote the clustering performance and call it constrained deep embedded clustering(CDEC).In this manner,an autoencoder was implemented to elicit informative low dimensional features in the latent space and then retrain the encoder network using a proposed Kullback-Leibler divergence objective function,which captures the constraints in order to cluster the projected samples.The proposed CDEC has been compared with the adversarial autoencoder,constrained 1-spectral clustering and autoencoder t k-means was applied to the known MNIST,Reuters-10k and USPS datasets,and their performance were assessed in terms of clustering accuracy.Empirical results confirmed the statistical superiority of CDEC in terms of clustering accuracy to the counterparts.Findings-First of all,this is the first DNN-constrained clustering that uses side information to improve the performance of clustering without using labels in big datasets with high dimension.Second,the author defined a formula to inject side information to the DNN.Third,the proposed method improves clustering performance and network convergence speed.Originality/value-Little works have focused on constrained clustering for big datasets;also,the studies in DNNs for clustering,with specific loss function that simultaneously extract features and clustering the data,are rare.The method improves the performance of big data clustering without using labels,and it is important because the data labeling is expensive and time-consuming,especially for big datasets.
文摘Purpose–Constrained clustering is an important recent development in clustering literature.The goal of an algorithm in constrained clustering research is to improve the quality of clustering by making use of background knowledge.The purpose of this paper is to suggest a new perspective for constrained clustering,by finding an effective transformation of data into target space on the reference of background knowledge given in the form of pairwise must-and cannot-link constraints.Design/methodology/approach–Most of existing methods in constrained clustering are limited to learn a distance metric or kernel matrix from the background knowledge while looking for transformation of data in target space.Unlike previous efforts,the author presents a non-linear method for constraint clustering,whose basic idea is to use different non-linear functions for each dimension in target space.Findings–The outcome of the paper is a novel non-linear method for constrained clustering which uses different non-linearfunctions for each dimension in target space.The proposed method for a particular case is formulated and explained for quadratic functions.To reduce the number of optimization parameters,the proposed method is modified to relax the quadratic function and approximate it by a factorized version that is easier to solve.Experimental results on synthetic and real-world data demonstrate the efficacy of the proposed method.Originality/value–This study proposes a new direction to the problem of constrained clustering by learning a non-linear transformation of data into target space without using kernel functions.This work will assist researchers to start development of new methods based on the proposed framework which will potentially provide them with new research topics.
基金This work was supported by the National Natural Science Foundation of China(Nos.12271098 and 61772005)the Outstanding Youth Innovation Team Project for Universities of Shandong Province(No.2020KJN008)。
文摘Constrained clustering,such as k-means with instance-level Must-Link(ML)and Cannot-Link(CL)auxiliary information as the constraints,has been extensively studied recently,due to its broad applications in data science and AI.Despite some heuristic approaches,there has not been any algorithm providing a non-trivial approximation ratio to the constrained k-means problem.To address this issue,we propose an algorithm with a provable approximation ratio of O(logk)when only ML constraints are considered.We also empirically evaluate the performance of our algorithm on real-world datasets having artificial ML and disjoint CL constraints.The experimental results show that our algorithm outperforms the existing greedy-based heuristic methods in clustering accuracy.
基金supported by the National Natural Science Foundation of China (No. 61300164)
文摘As a kind of weaker supervisory information, pairwise constraints can be exploited to guide the data analysis process, such as data clustering. This paper formulates pairwise constraint propagation, which aims to predict the large quantity of unknown constraints from scarce known constraints, as a low-rank matrix recovery(LMR) problem. Although recent advances in transductive learning based on matrix completion can be directly adopted to solve this problem, our work intends to develop a more general low-rank matrix recovery solution for pairwise constraint propagation, which not only completes the unknown entries in the constraint matrix but also removes the noise from the data matrix. The problem can be effectively solved using an augmented Lagrange multiplier method. Experimental results on constrained clustering tasks based on the propagated pairwise constraints have shown that our method can obtain more stable results than state-of-the-art algorithms,and outperform them.