Unsupervised image translation(UIT)studies the mapping between two image domains.Since such mappings are under-constrained,existing research has pursued various desirable properties such as distributional matching or ...Unsupervised image translation(UIT)studies the mapping between two image domains.Since such mappings are under-constrained,existing research has pursued various desirable properties such as distributional matching or two-way consistency.In this paper,we re-examine UIT from a new perspective:distributional semantics consistency,based on the observation that data variations contain semantics,e.g.,shoes varying in colors.Further,the semantics can be multi-dimensional,e.g.,shoes also varying in style,functionality,etc.Given two image domains,matching these semantic dimensions during UIT will produce mappings with explicable correspondences,which has not been investigated previously.We propose distributional semantics mapping(DSM),the first UIT method which explicitly matches semantics between two domains.We show that distributional semantics has been rarely considered within and beyond UIT,even though it is a common problem in deep learning.We evaluate DSM on several benchmark datasets,demonstrating its general ability to capture distributional semantics.Extensive comparisons show that DSM not only produces explicable mappings,but also improves image quality in general.展开更多
In this paper,we present an efficient algorithm that generates lip-synchronized facial animation from a given vocal audio clip.By combining spectral-dimensional bidirectional long short-term memory and temporal attent...In this paper,we present an efficient algorithm that generates lip-synchronized facial animation from a given vocal audio clip.By combining spectral-dimensional bidirectional long short-term memory and temporal attention mechanism,we design a light-weight speech encoder that leams useful and robust vocal features from the input audio without resorting to pre-trained speech recognition modules or large training data.To learn subject-independent facial motion,we use deformation gradients as the internal representation,which allows nuanced local motions to be better synthesized than using vertex offsets.Compared with state-of-the-art automatic-speech-recognition-based methods,our model is much smaller but achieves similar robustness and quality most of the time,and noticeably better results in certain challenging cases.展开更多
基金supported by National Natural Science Foundation of China(Grant No.61772462)the 100 Talents Program of Zhejiang University。
文摘Unsupervised image translation(UIT)studies the mapping between two image domains.Since such mappings are under-constrained,existing research has pursued various desirable properties such as distributional matching or two-way consistency.In this paper,we re-examine UIT from a new perspective:distributional semantics consistency,based on the observation that data variations contain semantics,e.g.,shoes varying in colors.Further,the semantics can be multi-dimensional,e.g.,shoes also varying in style,functionality,etc.Given two image domains,matching these semantic dimensions during UIT will produce mappings with explicable correspondences,which has not been investigated previously.We propose distributional semantics mapping(DSM),the first UIT method which explicitly matches semantics between two domains.We show that distributional semantics has been rarely considered within and beyond UIT,even though it is a common problem in deep learning.We evaluate DSM on several benchmark datasets,demonstrating its general ability to capture distributional semantics.Extensive comparisons show that DSM not only produces explicable mappings,but also improves image quality in general.
文摘In this paper,we present an efficient algorithm that generates lip-synchronized facial animation from a given vocal audio clip.By combining spectral-dimensional bidirectional long short-term memory and temporal attention mechanism,we design a light-weight speech encoder that leams useful and robust vocal features from the input audio without resorting to pre-trained speech recognition modules or large training data.To learn subject-independent facial motion,we use deformation gradients as the internal representation,which allows nuanced local motions to be better synthesized than using vertex offsets.Compared with state-of-the-art automatic-speech-recognition-based methods,our model is much smaller but achieves similar robustness and quality most of the time,and noticeably better results in certain challenging cases.