The widespread adoption of mobile Internet and the Internet of things(IoT)has led to a significant increase in the amount of video data.While video data are increasingly important,language and text remain the primary ...The widespread adoption of mobile Internet and the Internet of things(IoT)has led to a significant increase in the amount of video data.While video data are increasingly important,language and text remain the primary methods of interaction in everyday communication,text-based cross-modal retrieval has become a crucial demand in many applications.Most previous text-video retrieval works utilize implicit knowledge of pre-trained models such as contrastive language-image pre-training(CLIP)to boost retrieval performance.However,implicit knowledge only records the co-occurrence relationship existing in the data,and it cannot assist the model to understand specific words or scenes.Another type of out-of-domain knowledge—explicit knowledge—which is usually in the form of a knowledge graph,can play an auxiliary role in understanding the content of different modalities.Therefore,we study the application of external knowledge base in text-video retrieval model for the first time,and propose KnowER,a model based on knowledge enhancement for efficient text-video retrieval.The knowledge-enhanced model achieves state-of-the-art performance on three widely used text-video retrieval datasets,i.e.,MSRVTT,DiDeMo,and MSVD.展开更多
Video-text retrieval (VTR) is an essential task in multimodal learning, aiming to bridge the semantic gap between visual and textual data. Effective video frame sampling plays a crucial role in improving retrieval per...Video-text retrieval (VTR) is an essential task in multimodal learning, aiming to bridge the semantic gap between visual and textual data. Effective video frame sampling plays a crucial role in improving retrieval performance, as it determines the quality of the visual content representation. Traditional sampling methods, such as uniform sampling and optical flow-based techniques, often fail to capture the full semantic range of videos, leading to redundancy and inefficiencies. In this work, we propose CLIP4Video-Sampling: Global Semantics-Guided Multi-Granularity Frame Sampling for Video-Text Retrieval, a global semantics-guided multi-granularity frame sampling strategy designed to optimize both computational efficiency and retrieval accuracy. By integrating multi-scale global and local temporal sampling and leveraging the CLIP (Contrastive Language-Image Pre-training) model’s powerful feature extraction capabilities, our method significantly outperforms existing approaches in both zero-shot and fine-tuned video-text retrieval tasks on popular datasets. CLIP4Video-Sampling reduces redundancy, ensures keyframe coverage, and serves as an adaptable pre-processing module for multimodal models.展开更多
基金supported by the National Key Research and Development Program of China(No.2020YFB1406800).
文摘The widespread adoption of mobile Internet and the Internet of things(IoT)has led to a significant increase in the amount of video data.While video data are increasingly important,language and text remain the primary methods of interaction in everyday communication,text-based cross-modal retrieval has become a crucial demand in many applications.Most previous text-video retrieval works utilize implicit knowledge of pre-trained models such as contrastive language-image pre-training(CLIP)to boost retrieval performance.However,implicit knowledge only records the co-occurrence relationship existing in the data,and it cannot assist the model to understand specific words or scenes.Another type of out-of-domain knowledge—explicit knowledge—which is usually in the form of a knowledge graph,can play an auxiliary role in understanding the content of different modalities.Therefore,we study the application of external knowledge base in text-video retrieval model for the first time,and propose KnowER,a model based on knowledge enhancement for efficient text-video retrieval.The knowledge-enhanced model achieves state-of-the-art performance on three widely used text-video retrieval datasets,i.e.,MSRVTT,DiDeMo,and MSVD.
文摘Video-text retrieval (VTR) is an essential task in multimodal learning, aiming to bridge the semantic gap between visual and textual data. Effective video frame sampling plays a crucial role in improving retrieval performance, as it determines the quality of the visual content representation. Traditional sampling methods, such as uniform sampling and optical flow-based techniques, often fail to capture the full semantic range of videos, leading to redundancy and inefficiencies. In this work, we propose CLIP4Video-Sampling: Global Semantics-Guided Multi-Granularity Frame Sampling for Video-Text Retrieval, a global semantics-guided multi-granularity frame sampling strategy designed to optimize both computational efficiency and retrieval accuracy. By integrating multi-scale global and local temporal sampling and leveraging the CLIP (Contrastive Language-Image Pre-training) model’s powerful feature extraction capabilities, our method significantly outperforms existing approaches in both zero-shot and fine-tuned video-text retrieval tasks on popular datasets. CLIP4Video-Sampling reduces redundancy, ensures keyframe coverage, and serves as an adaptable pre-processing module for multimodal models.