In this paper, we present Emotion-Aware Music Driven Movie Montage, a novel paradigm for the challenging task of generating movie montages. Specifically, given a movie and a piece of music as the guidance, our method ...In this paper, we present Emotion-Aware Music Driven Movie Montage, a novel paradigm for the challenging task of generating movie montages. Specifically, given a movie and a piece of music as the guidance, our method aims to generate a montage out of the movie that is emotionally consistent with the music. Unlike previous work such as video summarization, this task requires not only video content understanding, but also emotion analysis of both the input movie and music. To this end, we propose a two-stage framework, including a learning-based module for the prediction of emotion similarity and an optimization-based module for the selection and composition of candidate movie shots. The core of our method is to align and estimate emotional similarity between music clips and movie shots in a multi-modal latent space via contrastive learning. Subsequently, the montage generation is modeled as a joint optimization of emotion similarity and additional constraints such as scene-level story completeness and shot-level rhythm synchronization. We conduct both qualitative and quantitative evaluations to demonstrate that our method can generate emotionally consistent montages and outperforms alternative baselines.展开更多
This study introduces a novel conditional recycle generative adversarial network for facial attribute transfor- mation, which can transform high-level semantic face attributes without changing the identity. In our app...This study introduces a novel conditional recycle generative adversarial network for facial attribute transfor- mation, which can transform high-level semantic face attributes without changing the identity. In our approach, we input a source facial image to the conditional generator with target attribute condition to generate a face with the target attribute. Then we recycle the generated face back to the same conditional generator with source attribute condition. A face which should be similar to that of the source face in personal identity and facial attributes is generated. Hence, we introduce a recycle reconstruction loss to enforce the final generated facial image and the source facial image to be identical. Evaluations on the CelebA dataset demonstrate the effectiveness of our approach. Qualitative results show that our approach can learn and generate high-quality identity-preserving facial images with specified attributes.展开更多
Current multi-operator image resizing methods succeed in generating impressive results by using image similarity measure to guide the resizing process. An optimal operation path is found in the resizing space. However...Current multi-operator image resizing methods succeed in generating impressive results by using image similarity measure to guide the resizing process. An optimal operation path is found in the resizing space. However, their slow resizing speed caused by inefficient computation strategy of the bidirectional patch matching becomes a drawback in practical use. In this paper, we present a novel method to address this problem. By combining seam carving with scaling and cropping, our method can realize content-aware image resizing very fast. We define cost functions combing image energy and dominant color descriptor for all the operators to evaluate the damage to both local image content and global visual effect. Therefore our algorithm can automatically find an optimal sequence of operations to resize the image by using dynamic programming or greedy algorithm. We also extend our algorithm to indirect image resizing which can protect the aspect ratio of the dominant object in an image.展开更多
Vision Transformer has shown impressive performance on the image classification tasks.Observing that most existing visual style transfer(VST)algorithms are based on the texture-biased convolution neural network(CNN),h...Vision Transformer has shown impressive performance on the image classification tasks.Observing that most existing visual style transfer(VST)algorithms are based on the texture-biased convolution neural network(CNN),here raises the question of whether the shape-biased Vision Transformer can perform style transfer as CNN.In this work,we focus on comparing and analyzing the shape bias between CNN-and transformer-based models from the view of VST tasks.For comprehensive comparisons,we propose three kinds of transformer-based visual style transfer(Tr-VST)methods(Tr-NST for optimization-based VST,Tr-WCT for reconstruction-based VST and Tr-AdaIN for perceptual-based VST).By engaging three mainstream VST methods in the transformer pipeline,we show that transformer-based models pre-trained on ImageNet are not proper for style transfer methods.Due to the strong shape bias of the transformer-based models,these Tr-VST methods cannot render style patterns.We further analyze the shape bias by considering the influence of the learned parameters and the structure design.Results prove that with proper style supervision,the transformer can learn similar texture-biased features as CNN does.With the reduced shape bias in the transformer encoder,Tr-VST methods can generate higher-quality results compared with state-of-the-art VST methods.展开更多
基金supported by the National Key Research and Development Program of China under Grant No.2020AAA0106200 and the National Natural Science Foundation of China under Grant No.61832016.
文摘In this paper, we present Emotion-Aware Music Driven Movie Montage, a novel paradigm for the challenging task of generating movie montages. Specifically, given a movie and a piece of music as the guidance, our method aims to generate a montage out of the movie that is emotionally consistent with the music. Unlike previous work such as video summarization, this task requires not only video content understanding, but also emotion analysis of both the input movie and music. To this end, we propose a two-stage framework, including a learning-based module for the prediction of emotion similarity and an optimization-based module for the selection and composition of candidate movie shots. The core of our method is to align and estimate emotional similarity between music clips and movie shots in a multi-modal latent space via contrastive learning. Subsequently, the montage generation is modeled as a joint optimization of emotion similarity and additional constraints such as scene-level story completeness and shot-level rhythm synchronization. We conduct both qualitative and quantitative evaluations to demonstrate that our method can generate emotionally consistent montages and outperforms alternative baselines.
基金This work was supported by the National Natural Science Foundation of China under Grant Nos. 61672520, 61573348, 61620106003, and 61720106006, the Beijing Natural Science Foundation of China under Grant No. 4162056, the National Key Technology Research and Development Program of China under Grant No. 2015BAH53F02, and the CASIA-Tencent YouTu Jointly Research Project. The Titan X used for this research was donated by the NVIDIA Corporation.
文摘This study introduces a novel conditional recycle generative adversarial network for facial attribute transfor- mation, which can transform high-level semantic face attributes without changing the identity. In our approach, we input a source facial image to the conditional generator with target attribute condition to generate a face with the target attribute. Then we recycle the generated face back to the same conditional generator with source attribute condition. A face which should be similar to that of the source face in personal identity and facial attributes is generated. Hence, we introduce a recycle reconstruction loss to enforce the final generated facial image and the source facial image to be identical. Evaluations on the CelebA dataset demonstrate the effectiveness of our approach. Qualitative results show that our approach can learn and generate high-quality identity-preserving facial images with specified attributes.
基金supported by the National Natural Science Foundation of China (NSFC) under Grant Nos. 60872120, 60902078, 61172104the Natural Science Foundation of Beijing under Grant No. 4112061+2 种基金the Scientific Research Foundation for the Returned Overseas Chinese Scholars of State Education Ministry of Chinathe French System@tic Paris-Region (CSDL Project)the National Agency for Research of French (ANR)-NSFC under Grant No. 60911130368
文摘Current multi-operator image resizing methods succeed in generating impressive results by using image similarity measure to guide the resizing process. An optimal operation path is found in the resizing space. However, their slow resizing speed caused by inefficient computation strategy of the bidirectional patch matching becomes a drawback in practical use. In this paper, we present a novel method to address this problem. By combining seam carving with scaling and cropping, our method can realize content-aware image resizing very fast. We define cost functions combing image energy and dominant color descriptor for all the operators to evaluate the damage to both local image content and global visual effect. Therefore our algorithm can automatically find an optimal sequence of operations to resize the image by using dynamic programming or greedy algorithm. We also extend our algorithm to indirect image resizing which can protect the aspect ratio of the dominant object in an image.
基金the National Key Research and Development Program of China under Grant No.2020AAA0106200the National Natural Science Foundation of China under Grant Nos.62102162,61832016,U20B2070,and 6210070958+1 种基金the CASIA-Tencent Youtu Joint Research Projectthe Open Projects Program of the National Laboratory of Pattern Recognition.
文摘Vision Transformer has shown impressive performance on the image classification tasks.Observing that most existing visual style transfer(VST)algorithms are based on the texture-biased convolution neural network(CNN),here raises the question of whether the shape-biased Vision Transformer can perform style transfer as CNN.In this work,we focus on comparing and analyzing the shape bias between CNN-and transformer-based models from the view of VST tasks.For comprehensive comparisons,we propose three kinds of transformer-based visual style transfer(Tr-VST)methods(Tr-NST for optimization-based VST,Tr-WCT for reconstruction-based VST and Tr-AdaIN for perceptual-based VST).By engaging three mainstream VST methods in the transformer pipeline,we show that transformer-based models pre-trained on ImageNet are not proper for style transfer methods.Due to the strong shape bias of the transformer-based models,these Tr-VST methods cannot render style patterns.We further analyze the shape bias by considering the influence of the learned parameters and the structure design.Results prove that with proper style supervision,the transformer can learn similar texture-biased features as CNN does.With the reduced shape bias in the transformer encoder,Tr-VST methods can generate higher-quality results compared with state-of-the-art VST methods.