摘要
k-均值问题是理论计算机科学和组合优化领域的经典问题之一.相应的Lloyd算法是数据挖掘的十大经典算法之一,在各种领域被广泛研究和应用,特别是在图像处理和特征工程方面.随着数据多样性和数据量的爆炸性增长,在实际应用中遇到的k-均值聚类问题更加复杂多样,产生了各种亟需解决的具有挑战性的研究课题. k-均值问题在理论上是NP-难的.本文介绍经典k-均值问题及其变形的基于局部搜索、线性规划舍入、原始对偶、对偶拟合和Lagrange松弛等技术的有效算法.首先介绍经典k-均值问题的近似算法、加倍度量空间中的有效多项式时间近似方案及满足稳定性实例的多项式可解性,然后介绍k-均值问题的若干重要变形,包括k-中位、球面k-均值、鲁棒k-均值、带约束的k-均值和隐私保护k-均值等问题,最后列出k-均值领域中的若干公开问题.
The k-means problem is one of the classical problems in theoretical computer science and combinatorial optimization. Meanwhile, the corresponding Lloyd algorithm is one of the ten classical algorithms in data mining. It has been studied in various fields and has a lot of applications, especially on image processing and feature engineering. With the explosive growth of data diversity and quantity, the k-means clustering in practical applications are more complex and diversified. A variety of challenging research topics have emerged that need to be solved urgently. The k-means problem is theoretically NP-hard. In this paper, we introduce effective algorithms based on local search, linear programming rounding, primal-dual, dual-fitting, Lagrange relaxation and other techniques for the classical k-means problem and its variants. We begin with the review of improving approximation algorithms for the classical k-means problem. Then we introduce effective polynomial-time approximation scheme in the doubling metric space and polynomial solvability of stable instances. We further survey several important variants of k-means problems including k-median, spherical k-means, robust k-means, constrained k-means,privacy preserving k-means, etc. Finally, we discuss some open problems for k-means problems.
作者
张冬梅
李敏
徐大川
张真宁
Dongmei Zhang;Min Li;Dachuan Xu;Zhenning Zhang
出处
《中国科学:数学》
CSCD
北大核心
2020年第9期1387-1404,共18页
Scientia Sinica:Mathematica
基金
国家自然科学基金(批准号:11531014和11871081)
山东省高校科研计划(批准号:J17KA171)
山东省自然科学基金(批准号:ZR2019MA032)
北京市教委科技项目(批准号:KM201810005006)资助项目。
关键词
K-均值
近似算法
线性规划
k-means
approximation algorithm
linear programming