摘要
The past decade has witnessed the impressive and steady development of single-modal AI technologies in several fields,thanks to the emergence of deep learning.Less studied,however,is multi-modal AI-commonly considered the next generation of AI-which utilizes complementary context concealed in different-modality inputs to improve performance.Humans naturally learn to form a global concept from multiple modalities(i.e.,sight,hearing,touch,smell,and taste),even when some are incomplete or missing.Thus,in addition to the two popular modalities(vision and language),other types of data such as depth,infrared information,and events are also important for multi-modal learning in real-world scenes.