Scene perception and trajectory forecasting are two fundamental challenges that are crucial to a safe and reliable autonomous driving(AD)system.However,most proposed methods aim at addressing one of the two challenges...Scene perception and trajectory forecasting are two fundamental challenges that are crucial to a safe and reliable autonomous driving(AD)system.However,most proposed methods aim at addressing one of the two challenges mentioned above with a single model.To tackle this dilemma,this paper proposes spatio-temporal semantics and interaction graph aggregation for multi-agent perception and trajectory forecasting(STSIGMA),an efficient end-to-end method to jointly and accurately perceive the AD environment and forecast the trajectories of the surrounding traffic agents within a unified framework.ST-SIGMA adopts a trident encoder-decoder architecture to learn scene semantics and agent interaction information on bird’s-eye view(BEV)maps simultaneously.Specifically,an iterative aggregation network is first employed as the scene semantic encoder(SSE)to learn diverse scene information.To preserve dynamic interactions of traffic agents,ST-SIGMA further exploits a spatio-temporal graph network as the graph interaction encoder.Meanwhile,a simple yet efficient feature fusion method to fuse semantic and interaction features into a unified feature space as the input to a novel hierarchical aggregation decoder for downstream prediction tasks is designed.Extensive experiments on the nuScenes data set have demonstrated that the proposed ST-SIGMA achieves significant improvements compared to the state-of-theart(SOTA)methods in terms of scene perception and trajectory forecasting,respectively.Therefore,the proposed approach outperforms SOTA in terms of model generalisation and robustness and is therefore more feasible for deployment in realworld AD scenarios.展开更多
Scene segmentation is widely used in autonomous driving for environmental perception.Semantic scene segmentation has gained considerable attention owing to its rich semantic information.It assigns labels to the pixels...Scene segmentation is widely used in autonomous driving for environmental perception.Semantic scene segmentation has gained considerable attention owing to its rich semantic information.It assigns labels to the pixels in an image,thereby enabling automatic image labeling.Current approaches are based mainly on convolutional neural networks(CNN),however,they rely on numerous labels.Therefore,the use of a small amount of labeled data to achieve semantic segmentation has become increasingly important.In this study,we developed a domain adaptation framework based on optimal transport(OT)and an attention mechanism to address this issue.Specifically,we first generated the output space via a CNN owing to its superior of feature representation.Second,we utilized OT to achieve a more robust alignment of the source and target domains in the output space,where the OT plan defined a well attention mechanism to improve the adaptation of the model.In particular,the OT reduced the number of network parameters and made the network more interpretable.Third,to better describe the multiscale properties of the features,we constructed a multiscale segmentation network to perform domain adaptation.Finally,to verify the performance of the proposed method,we conducted an experiment to compare the proposed method with three benchmark and four SOTA methods using three scene datasets.The mean intersection-over-union(mIOU)was significantly improved,and visualization results under multiple domain adaptation scenarios also show that the proposed method performed better than semantic segmentation methods.展开更多
基金Basic and Advanced Research Projects of CSTC,Grant/Award Number:cstc2019jcyj-zdxmX0008Science and Technology Research Program of Chongqing Municipal Education Commission,Grant/Award Numbers:KJQN202100634,KJZDK201900605National Natural Science Foundation of China,Grant/Award Number:62006065。
文摘Scene perception and trajectory forecasting are two fundamental challenges that are crucial to a safe and reliable autonomous driving(AD)system.However,most proposed methods aim at addressing one of the two challenges mentioned above with a single model.To tackle this dilemma,this paper proposes spatio-temporal semantics and interaction graph aggregation for multi-agent perception and trajectory forecasting(STSIGMA),an efficient end-to-end method to jointly and accurately perceive the AD environment and forecast the trajectories of the surrounding traffic agents within a unified framework.ST-SIGMA adopts a trident encoder-decoder architecture to learn scene semantics and agent interaction information on bird’s-eye view(BEV)maps simultaneously.Specifically,an iterative aggregation network is first employed as the scene semantic encoder(SSE)to learn diverse scene information.To preserve dynamic interactions of traffic agents,ST-SIGMA further exploits a spatio-temporal graph network as the graph interaction encoder.Meanwhile,a simple yet efficient feature fusion method to fuse semantic and interaction features into a unified feature space as the input to a novel hierarchical aggregation decoder for downstream prediction tasks is designed.Extensive experiments on the nuScenes data set have demonstrated that the proposed ST-SIGMA achieves significant improvements compared to the state-of-theart(SOTA)methods in terms of scene perception and trajectory forecasting,respectively.Therefore,the proposed approach outperforms SOTA in terms of model generalisation and robustness and is therefore more feasible for deployment in realworld AD scenarios.
基金supported by the National Natural Science Foundation of China(11971296)National Key R&D Program of China(2021YFA1003004).
文摘Scene segmentation is widely used in autonomous driving for environmental perception.Semantic scene segmentation has gained considerable attention owing to its rich semantic information.It assigns labels to the pixels in an image,thereby enabling automatic image labeling.Current approaches are based mainly on convolutional neural networks(CNN),however,they rely on numerous labels.Therefore,the use of a small amount of labeled data to achieve semantic segmentation has become increasingly important.In this study,we developed a domain adaptation framework based on optimal transport(OT)and an attention mechanism to address this issue.Specifically,we first generated the output space via a CNN owing to its superior of feature representation.Second,we utilized OT to achieve a more robust alignment of the source and target domains in the output space,where the OT plan defined a well attention mechanism to improve the adaptation of the model.In particular,the OT reduced the number of network parameters and made the network more interpretable.Third,to better describe the multiscale properties of the features,we constructed a multiscale segmentation network to perform domain adaptation.Finally,to verify the performance of the proposed method,we conducted an experiment to compare the proposed method with three benchmark and four SOTA methods using three scene datasets.The mean intersection-over-union(mIOU)was significantly improved,and visualization results under multiple domain adaptation scenarios also show that the proposed method performed better than semantic segmentation methods.