摘要
街道场景视频实例分割是无人驾驶技术研究中的关键问题之一,可为车辆在街道场景下的环境感知和路径规划提供决策依据.针对现有方法存在多纵横比锚框应用单一感受野采样导致边缘特征提取不充分以及高层特征金字塔空间细节位置信息匮乏的问题,本文提出锚框校准和空间位置信息补偿视频实例分割(Anchor frame calibration and Spatial position information compensation for Video Instance Segmentation,AS-VIS)网络.首先,在预测头3个分支中添加锚框校准模块实现同锚框纵横比匹配的多类型感受野采样,解决目标边缘提取不充分问题.其次,设计多感受野下采样模块将各种感受野采样后的特征融合,解决下采样信息缺失问题.最后,应用多感受野下采样模块将特征金字塔低层目标区域激活特征映射嵌入到高层中实现空间位置信息补偿,解决高层特征空间细节位置信息匮乏问题.在Youtube-VIS标准库中提取街道场景视频数据集,其中包括训练集329个视频和验证集53个视频.实验结果与YolactEdge检测和分割精度指标定量对比表明,锚框校准平均精度分别提升8.63%和5.09%,空间位置信息补偿特征金字塔平均精度分别提升7.76%和4.75%,AS-VIS总体平均精度分别提升9.26%和6.46%.本文方法实现了街道场景视频序列实例级同步检测、跟踪与分割,为无人驾驶车辆环境感知提供有效的理论依据.
Due to the decision-making provision for vehicle environment perception and path planning,street scenes video instance segmentation as one of the key issues in research of self-driving technology has aroused wide concern.How-ever,current researches focus on insufficient edge feature extraction,which is caused by utilization of single receptive field sampling for multi-aspect ratio anchor frames and deficiencies of spatial detailed position information in the high-level fea-ture pyramid architecture.To alleviate these problems,we propose a network anchor frame calibration and spatial posi-tion information compensation for video instance segmentation(AS-VIS).Firstly,we conduct the anchor frame calibra-tion module as additional branch in parallel with three prediction branches to align multi-type receptive field sampling with different aspect ratio of anchor frame.Secondly,a multi-receptive field subsampling module is designed to fuse the features of various receptive fields achieving less information missing compared with traditional down-sampling.Finally,for spatial location information compensation and detail location information dispersion in the higher-level feature space,we design multi-receptive field subsampling module embedded in higher level to map active feature of target region in lower level of the feature pyramid.The street scene video dataset is extracted from Youtube-VIS benchmark,including 329 videos in training set and 53 videos in validation set.Quantitative comparison of experimental results with Yolact-Edge show that the average accuracy of anchor frame calibration is improved by 8.63%and 5.09%,spatial position infor-mation compensation feature pyramid network is improved by 7.76%and 4.75%,and the overall average accuracy of AS-VIS is improved by 9.26%and 6.46%.The proposed network AS-VIS realizes detection,tracking,and segmentation syn-chronously on instance-level street scene video sequences,and provides an effective theoretical basis for environment per-ception of self-driving vehicles.
作者
张印辉
赵崇任
何自芬
杨宏宽
黄滢
ZHANG Yin-hui;ZHAO Chong-ren;HE Zi-fen;YANG Hong-kuan;HUANG Ying(Department of Mechanical and Electrical Engineering,Kunming University of Science and Technology,Kunming,Yunnan 650500,China)
出处
《电子学报》
EI
CAS
CSCD
北大核心
2024年第1期94-106,共13页
Acta Electronica Sinica
基金
国家自然科学基金(No.62061022,No.62171206)。
关键词
街道场景
视频实例分割
锚框校准
空间信息补偿
无人驾驶
street scene
video instance segmentation
anchor frame calibration
spatial information compensation
self-driving vehicle