摘要
由于弱监督时序定位模型没有帧级的监督信号,模型识别动作实例在边界处容易出现两个问题:过多地关注动作最具识别的部分,忽略了动作的其他部分而导致了动作的欠定位;动作的边界处与背景极其相似,模型难以区分而导致了动作的过定位。为了进一步有效的分类动作片段,改善边界困难样本的欠定位和过定位问题,提出了一种两阶段的弱监督时序定位。该方法分为两个阶段,第一阶段中我们对输入的视频帧提取RGB和光流特征,设计一种困难样本挖掘策略,得到边界的困难样本集合和易动作样本集合。另外,我们设计了一种原型生成模块,得到了每个动作类别的原型中心,将第二阶段的动作分类任务转换成嵌入空间与原型中心的距离问题。在第二阶段中,输入第一阶段得到的困难样本集合,使用原型匹配模块得到特定的时间类激活图。另外光流特征因其表达动态的特性,应当给予重视。本文设计了一种困难样本集合与易动作样本集合进行相似度计算得到增强光流特征的方法,实现边界困难样本更加准确地动作预测。最后为了进一步优化模型预测的动作标签,采用伪标签策略,为模型提供有效的帧级监督信号。在THUMOS’14和ActivityNet v1.2数据集进行实验论证。实验结果表明,方法性能优于现有弱监督时序定位方法。
Since the weakly supervised temporal localization model has no frame-level supervisory signal, the model recognizes action instances at the boundary and is prone to two problems: underlocalization of the action by focusing too much on the most recognized part of the action and ignoring the other parts of the action;overlocalization of the action by making the boundary of the action extremely similar to the background, which is difficult for the model to distinguish. In order to further classify action fragments effectively and improve the under- and over-localization problems of boundary-hard samples, a two-stage weakly supervised temporal localization is proposed. The method is divided into two stages. In the first stage, we extract RGB and optical flow features from the input video frames and design a difficult sample mining strategy to obtain the set of boundary difficult samples and the set of easy action samples. In addition, we design a prototype generation module to obtain the prototype center of each action category, and convert the action classification task in the second stage into a distance problem between the embedding space and the prototype center. In the second stage, the set of difficult samples obtained in the first stage is input and a specific temporal class activation map is obtained using the prototype matching module. In addition optical flow features should be given attention because of their property of expressing dynamics. In this paper, we design a method to obtain enhanced optical flow features by performing similarity calculation between the set of difficult samples and the set of easy action samples to achieve more accurate action prediction for boundary difficult samples. Finally, in order to further optimize the action labels predicted by the model, a pseudo-labeling strategy is used to provide an effective frame-level supervised signal for the model. Experimental demonstrations are performed on THUMOS’14 and ActivityNet 1.2 datasets. The experimental results show that the method performs better than existing weakly supervised temporal localization methods.
出处
《计算机科学与应用》
2023年第4期657-671,共15页
Computer Science and Application