Constrained clustering,such as k-means with instance-level Must-Link(ML)and Cannot-Link(CL)auxiliary information as the constraints,has been extensively studied recently,due to its broad applications in data science a...Constrained clustering,such as k-means with instance-level Must-Link(ML)and Cannot-Link(CL)auxiliary information as the constraints,has been extensively studied recently,due to its broad applications in data science and AI.Despite some heuristic approaches,there has not been any algorithm providing a non-trivial approximation ratio to the constrained k-means problem.To address this issue,we propose an algorithm with a provable approximation ratio of O(logk)when only ML constraints are considered.We also empirically evaluate the performance of our algorithm on real-world datasets having artificial ML and disjoint CL constraints.The experimental results show that our algorithm outperforms the existing greedy-based heuristic methods in clustering accuracy.展开更多
已有的聚类集算法基本上都是非监督聚类集成算法,这样不能利用已知信息,使得聚类集成的准确性、鲁棒性和稳定性降低.把半监督学习和聚类集成结合起来,设计半监督聚类集成模型来克服这些缺点.主要工作包括:第一,设计了基于贝叶斯网络的...已有的聚类集算法基本上都是非监督聚类集成算法,这样不能利用已知信息,使得聚类集成的准确性、鲁棒性和稳定性降低.把半监督学习和聚类集成结合起来,设计半监督聚类集成模型来克服这些缺点.主要工作包括:第一,设计了基于贝叶斯网络的半监督聚类集成(semi-supervised cluster ensemble,简称SCE)模型,并对模型用变分法进行了推理求解;第二,在此基础上,给出了EM(expectation maximization)框架下的具体算法;第三,从UCI(University of California,Irvine)机器学习库中选取部分数据来做实验.实验结果表明,SCE模型本身及其变分推理后所设计的EM算法都能进行半监督聚类集成,总的来说,效果比NMFS(algorithm of nonnegative-matrix-factorization based semi-supervised)、半监督SVM(support vector machine)、LVCE(latentvariable model for cluster ensemble)等算法要好.该半监督聚类集成模型聚集了半监督学习和聚类集成两者的优点,最后的聚类结果比单纯的半监督聚类或聚类集成的效果都要好.展开更多
Purpose–Constrained clustering is an important recent development in clustering literature.The goal of an algorithm in constrained clustering research is to improve the quality of clustering by making use of backgrou...Purpose–Constrained clustering is an important recent development in clustering literature.The goal of an algorithm in constrained clustering research is to improve the quality of clustering by making use of background knowledge.The purpose of this paper is to suggest a new perspective for constrained clustering,by finding an effective transformation of data into target space on the reference of background knowledge given in the form of pairwise must-and cannot-link constraints.Design/methodology/approach–Most of existing methods in constrained clustering are limited to learn a distance metric or kernel matrix from the background knowledge while looking for transformation of data in target space.Unlike previous efforts,the author presents a non-linear method for constraint clustering,whose basic idea is to use different non-linear functions for each dimension in target space.Findings–The outcome of the paper is a novel non-linear method for constrained clustering which uses different non-linearfunctions for each dimension in target space.The proposed method for a particular case is formulated and explained for quadratic functions.To reduce the number of optimization parameters,the proposed method is modified to relax the quadratic function and approximate it by a factorized version that is easier to solve.Experimental results on synthetic and real-world data demonstrate the efficacy of the proposed method.Originality/value–This study proposes a new direction to the problem of constrained clustering by learning a non-linear transformation of data into target space without using kernel functions.This work will assist researchers to start development of new methods based on the proposed framework which will potentially provide them with new research topics.展开更多
基金This work was supported by the National Natural Science Foundation of China(Nos.12271098 and 61772005)the Outstanding Youth Innovation Team Project for Universities of Shandong Province(No.2020KJN008)。
文摘Constrained clustering,such as k-means with instance-level Must-Link(ML)and Cannot-Link(CL)auxiliary information as the constraints,has been extensively studied recently,due to its broad applications in data science and AI.Despite some heuristic approaches,there has not been any algorithm providing a non-trivial approximation ratio to the constrained k-means problem.To address this issue,we propose an algorithm with a provable approximation ratio of O(logk)when only ML constraints are considered.We also empirically evaluate the performance of our algorithm on real-world datasets having artificial ML and disjoint CL constraints.The experimental results show that our algorithm outperforms the existing greedy-based heuristic methods in clustering accuracy.
文摘已有的聚类集算法基本上都是非监督聚类集成算法,这样不能利用已知信息,使得聚类集成的准确性、鲁棒性和稳定性降低.把半监督学习和聚类集成结合起来,设计半监督聚类集成模型来克服这些缺点.主要工作包括:第一,设计了基于贝叶斯网络的半监督聚类集成(semi-supervised cluster ensemble,简称SCE)模型,并对模型用变分法进行了推理求解;第二,在此基础上,给出了EM(expectation maximization)框架下的具体算法;第三,从UCI(University of California,Irvine)机器学习库中选取部分数据来做实验.实验结果表明,SCE模型本身及其变分推理后所设计的EM算法都能进行半监督聚类集成,总的来说,效果比NMFS(algorithm of nonnegative-matrix-factorization based semi-supervised)、半监督SVM(support vector machine)、LVCE(latentvariable model for cluster ensemble)等算法要好.该半监督聚类集成模型聚集了半监督学习和聚类集成两者的优点,最后的聚类结果比单纯的半监督聚类或聚类集成的效果都要好.
文摘Purpose–Constrained clustering is an important recent development in clustering literature.The goal of an algorithm in constrained clustering research is to improve the quality of clustering by making use of background knowledge.The purpose of this paper is to suggest a new perspective for constrained clustering,by finding an effective transformation of data into target space on the reference of background knowledge given in the form of pairwise must-and cannot-link constraints.Design/methodology/approach–Most of existing methods in constrained clustering are limited to learn a distance metric or kernel matrix from the background knowledge while looking for transformation of data in target space.Unlike previous efforts,the author presents a non-linear method for constraint clustering,whose basic idea is to use different non-linear functions for each dimension in target space.Findings–The outcome of the paper is a novel non-linear method for constrained clustering which uses different non-linearfunctions for each dimension in target space.The proposed method for a particular case is formulated and explained for quadratic functions.To reduce the number of optimization parameters,the proposed method is modified to relax the quadratic function and approximate it by a factorized version that is easier to solve.Experimental results on synthetic and real-world data demonstrate the efficacy of the proposed method.Originality/value–This study proposes a new direction to the problem of constrained clustering by learning a non-linear transformation of data into target space without using kernel functions.This work will assist researchers to start development of new methods based on the proposed framework which will potentially provide them with new research topics.