摘要
多示例学习(Multi-Instance Learning,MIL)的处理对象是包含若干示例的包,包有标签而示例通常没有标签.MIL的主要任务是学习已有包的特征信息以训练分类器.基于嵌入的MIL方法的主要策略是选择代表样本,将包嵌入到新的特征空间.然而,现有的大多数算法通常难以适应多样的数据分布,且单视角的嵌入可能导致向量在新特征空间中的特征值较弱.本文提出了多示例学习的自适应密度分布挖掘与三视角嵌入集成算法,包含3个关键技术:(1)自适应密度分布示例选择技术用于挖掘负示例空间的密度分布特征,将密度较大且相连的核心示例聚类成任意形状的簇,从而获得负代表示例集合;再根据正负示例间相似性最小化原则获得正代表示例集合.(2)三视角嵌入技术用于挖掘包的正、负和整体特征信息,并将包转为三个视角下的单向量.(3)集成技术分别基于三个视角下的向量训练3个单示例分类器,并通过硬投票集成这些分类器,从而获得最终MIL模型.在实验中,我们使用了来自4个领域的30个数据集,并与7个前沿MIL算法进行对比.结果表明ADTE算法在数据集上的平均准确性高于其它对比算法,尤其在文本分类和网页推荐数据集上取得了较好的效果.
The main object of Multi-Instance Learning(MIL)is a bag containing several instances,with the bag being labeled while the instances are usually unlabeled.The primary task of MIL is to grasp the distinctive feature information of these bags for classifier training.The main strategy of the embedding-based MIL method is to select representative instances and embed the bags into a new feature space.However,most existing algorithms are struggle with adapting diverse data distributions.Relying on single-perspective embedding may lead to vectors with weak eigenvalues in new feature spaces.In this paper,we propose the ADTE algorithm,which consists of three key techniques.(1)Adaptive density distribution instance selection technique is used to mine the density distribution characteristics of the negative instance space,clustering core instances with higher and connected densities into clusters of arbitrary shapes,thereby obtaining a set of negative representative instances.The set of positive representative instances is obtained based on the principle of minimizing similarity between positive and negative instances.(2)The tri-perspective embedding technique is employed to mine the positive,negative,and overall feature information of the bags and convert the bags into unidimensional vectors under three perspectives.(3)The ensemble technique trains three single-instance classifiers based on the vectors from the three perspectives respectively.These classifiers are then integrated through hard voting to obtain the final MIL model.In the experiments,we used 30 datasets from four domains and compared them with seven state-of-the-art MIL algorithms.The results show that the ADTE algorithm has a higher average accuracy on the datasets compared to other algorithms,particularly achieving better results in text classification and web recommendation datasets.
作者
陈天霖
杨梅
闵帆
方宇
CHEN Tianlin;YANG Mei;MIN Fan;FANG Yu(School of Computer Science,Southwest Petroleum University,Chengdu 610500,China;Institute for Artificial Intelligence,Southwest Petroleum University,Chengdu 610500,China;Lab of Machine Learning,Southwest Petroleum University,Chengdu 610500,China)
出处
《昆明理工大学学报(自然科学版)》
北大核心
2023年第6期54-65,共12页
Journal of Kunming University of Science and Technology(Natural Science)
基金
国家自然科学基金项目(62006200)
中央引导地方科技发展专项项目(2021ZYD0003)
四川省自然科学基金项目(2019YJ0314)
浙江省海洋大数据挖掘与应用重点实验室开放课题(OBDMA202102).
关键词
自适应密度
聚类
示例选择
多示例学习
三视角嵌入
adaptive density
clustering
instance selection
multi-instance learning
tri-perspective embedding