摘要
支持向量机(Support Vector Machine,SVM)是一种建立在结构风险最小化原则上的统计学习方法,以其在非线性、小样本以及高维问题中的独特优势被广泛应用于图像识别、故障诊断以及文本分类等领域.但SVM是一种监督学习算法,它旨在利用大量的、唯一且明确的真值标记样本来训练学习器,在不完全监督、不确切监督以及多义监督等弱监督场景下难以取得较好的效果.本文首先阐述了弱监督场景的概念和SVM的相关理论,然后从弱监督场景角度出发,系统地梳理了目前SVM算法的研究现状和发展,包括基于半监督学习、多示例学习以及多标记学习的方法;其中基于半监督学习的方法根据数据假设可细分为基于聚类假设和基于流形假设的方法,基于多标记学习的方法根据解决方案可细分为基于示例水平空间、基于包水平空间以及基于嵌入空间的方法,基于多标记学习的方法根据处理思路可细分为基于问题转换和基于算法自适应的方法;随后,本文总结了部分代表性算法在公开数据集上的实验结果;最后,探讨并展望了未来可能的研究方向.
Support Vector Machine(SVM)is a statistical learning method based on the principle of minimizing structural risk.It provides an intuitive geometric interpretation and rigorous math-ematical derivation,showing the unique advantages in handling nonlinear,few shot,and high dimensional problems.SVM has garnered significant attention and widely applied in various fields such as image recognition,fault diagnosis,and text classification.SVM is a classical supervised machine learning algorithm designed to train the learner using samples with complete,unique,and unambiguous ground-truth labels to ensure the generalization ability.However,as real-world application tasks become increasingly complex,creating such a sample set is laborious and difficult.On the one hand,it requires a significant amount of time and cost for data collection,cleaning,and debugging.For specific domains,especially in the medical field,experts often need to combine domain knowledge to process and label the samples.On the other hand,learning tasks in the real world often undergo changes and evolution.For example,data annotation criteria,annota-tion granularity,or downstream use cases may frequently change,requiring the re-labeling of sam-ples.Consequently,a large amount of samples in real-world applications lack complete and unambig-uous labels for the high cost of sample labeling.Moreover,samples in most practical task scenari-os may exhibit polysemous,that is,a sample can be associated with multiple labels at the same time.Therefore,standard SVM struggles to achieve satisfactory performance in weakly supervised scenarios such as incomplete supervision,inexact supervision,and polysemous supervision.Weakly supervised scenarios are contrasted with supervised scenarios.Unlike the latter,learning algorithms in weakly supervised scenarios are designed to train the learner using samples that may be limited,ambiguous,or only roughly labeled.From the perspective of weakly supervised sce-narios,this survey systematically reviews the current research status and development of SVM algorithms.Firstly,the concept of weakly supervised scenarios and the basic mathematical prin-ciple of SVM are briefly introduced.Secondly,the existing SVM algorithms in weakly supervised scenarios are divided into three categories according to different learning paradigms,namely,the semi-supervised learning based methods,the multiple instance learning based methods,and the multi-label learning based methods.Specifically,the semi-supervised learning based methods can be further subdivided into clustering assumption based approaches and manifold assumption based approaches according to data assumptions.The multiple instance learning based methods can be further classified into instance level based approaches,bag level based approaches and embedded space based approaches according to problem solutions.The multi-label learning based methods can be further refined into problem transformation based approaches and algorithm adaptation based approaches according to processing ideas.This survey provides a detailed introduction to the repre-sentative methods within these categories,summarizes and analyzes their characteristics and short-comings,offering a basis for selecting different SVM methods in various task scenarios.After that,the performance of some representative algorithms is evaluated and analyzed by carefully conducting experiments on publicly available datasets.Finally,potential research directions for the future development of SVM algorithms in weakly supervised scenarios are discussed,such as data imbalance,weakly supervised regression,mixed weakly supervised learning,large-scale deep-level tasks and learning problems for open enviroment.
作者
丁世飞
孙玉婷
梁志贞
郭丽丽
张健
徐晓
DING Shi-Fei;SUN Yu-Ting;LIANG Zhi-Zhen;GUO Li-Li;ZHANG Jian;XU Xiao(School of Computer Science and Technology,China University of Mining and Technology,Xuzhou,Jiangsu 221116;Mine Digitization Engineering Research Center of the Ministry of Education(China University of Mining and Technology),Xuzhou,Jiangsu 221116)
出处
《计算机学报》
EI
CAS
CSCD
北大核心
2024年第5期987-1009,共23页
Chinese Journal of Computers
基金
国家自然科学基金(62276265,61976216,62206297,62206296)资助.
关键词
弱监督场景
支持向量机
半监督学习
多示例学习
多标记学习
weakly supervised scenarios
support vector machine(SVM)
semi-supervised learning
multiple instance learning
multi-label learning