It remains an interesting and challenging problem to synthesize a vivid and realistic singing face driven by music. In this paper, we present a method for this task with natural motions for the lips, facial expression...It remains an interesting and challenging problem to synthesize a vivid and realistic singing face driven by music. In this paper, we present a method for this task with natural motions for the lips, facial expression, head pose, and eyes. Due to the coupling of mixed information for the human voice and backing music in common music audio signals, we design a decouple-and-fuse strategy to tackle the challenge. We first decompose the input music audio into a human voice stream and a backing music stream. Due to the implicit and complicated correlation between the two-stream input signals and the dynamics of the facial expressions, head motions, and eye states, we model their relationship with an attention scheme, where the effects of the two streams are fused seamlessly. Furthermore, to improve the expressivenes of the generated results, we decompose head movement generation in terms of speed and direction, and decompose eye state generation into short-term blinking and long-term eye closing, modeling them separately. We have also built a novel dataset, SingingFace, to support training and evaluation of models for this task, including future work on this topic. Extensive experiments and a user study show that our proposed method is capable of synthesizing vivid singing faces, qualitatively and quantitatively better than the prior state-of-the-art.展开更多
利用FACE(Free Air Carbon-dioxide Enrichment)平台技术,用伤流量法研究了低氮(LN150kg·hm-2)和常氮(NN250kg·hm-2)水平下,大气CO2浓度升高对水稻分蘖、抽穗期和穗后35d根系活力和根系N同化能力(氨基酸合成能力)的影响.结果...利用FACE(Free Air Carbon-dioxide Enrichment)平台技术,用伤流量法研究了低氮(LN150kg·hm-2)和常氮(NN250kg·hm-2)水平下,大气CO2浓度升高对水稻分蘖、抽穗期和穗后35d根系活力和根系N同化能力(氨基酸合成能力)的影响.结果表明,就整株水稻来看,CO2浓度升高和N处理对根系活力无显著影响;但由于FACE条件下水稻分蘖数增加14.5%(LN)和20.7%(NN),使每茎根系活力(伤流强度)降低1.4%~21.7%.在分蘖和抽穗期,虽然FACE处理促进了根系吸收的无机N向氨基酸转化,根系伤流液中氨基酸氮/无机氮提高11.1%~143.1%,但氨基酸浓度和合成总量和对照相比无明显差异.在穗后35d,FACE处理减弱了水稻根系的N同化能力,表现为根系伤流液中氨基酸/无机氮降低38.1%(LN)和29.2%(NN);同时氨基酸浓度降低34.0%(LN)和44.7%(NN),氨基酸合成总量降低50.8%(LN)和40.0%(NN).提高施氮水平促进了抽穗期水稻根系对无机氮的吸收,伤流液中无机氮含量增加51.1%(对照)和155.2%(FACE),但并未增加氨基酸合成量,由此导致抽穗期氨基酸氮/无机氮显著降低19.5%(对照)和36.8%(FACE);同时,氮处理在这个时期与FACE处理表现出明显的交互作用.展开更多
Voice portrait technology has explored and established the relationship between speakers’ voices and their facialfeatures, aiming to generate corresponding facial characteristics by providing the voice of an unknown ...Voice portrait technology has explored and established the relationship between speakers’ voices and their facialfeatures, aiming to generate corresponding facial characteristics by providing the voice of an unknown speaker.Due to its powerful advantages in image generation, Generative Adversarial Networks (GANs) have now beenwidely applied across various fields. The existing Voice2Face methods for voice portraits are primarily based onGANs trained on voice-face paired datasets. However, voice portrait models solely constructed on GANs facelimitations in image generation quality and struggle to maintain facial similarity. Additionally, the training processis relatively unstable, thereby affecting the overall generative performance of the model. To overcome the abovechallenges,wepropose a novel deepGenerativeAdversarialNetworkmodel for audio-visual synthesis, namedAVPGAN(Attention-enhanced Voice Portrait Model using Generative Adversarial Network). This model is based ona convolutional attention mechanism and is capable of generating corresponding facial images from the voice ofan unknown speaker. Firstly, to address the issue of training instability, we integrate convolutional neural networkswith deep GANs. In the network architecture, we apply spectral normalization to constrain the variation of thediscriminator, preventing issues such as mode collapse. Secondly, to enhance the model’s ability to extract relevantfeatures between the two modalities, we propose a voice portrait model based on convolutional attention. Thismodel learns the mapping relationship between voice and facial features in a common space from both channeland spatial dimensions independently. Thirdly, to enhance the quality of generated faces, we have incorporated adegradation removal module and utilized pretrained facial GANs as facial priors to repair and enhance the clarityof the generated facial images. Experimental results demonstrate that our AVP-GAN achieved a cosine similarity of0.511, outperforming the performance of our comparison model, and effectively achieved the generation of highqualityfacial images corresponding to a speaker’s voice.展开更多
Although initiated more than 20 years ago, there has been an explosion of scientific interest in computerized recognition of human faces in recent years. In this survey, we give an introductory course of this area. Ou...Although initiated more than 20 years ago, there has been an explosion of scientific interest in computerized recognition of human faces in recent years. In this survey, we give an introductory course of this area. Our focusis put on some new techniques, whose advantages and disadvantages are reviewed. Also, this survey summarizes thework of this area done in China. Finally, some conclusions are given.展开更多
基金This work was supported in part by grants from the National Key R&D Program of China(2021YFC3300403)National Natural Science Foundation of China(62072382)Yango Charitable Foundation,and the National Science Foundation(OAC-2007661).
文摘It remains an interesting and challenging problem to synthesize a vivid and realistic singing face driven by music. In this paper, we present a method for this task with natural motions for the lips, facial expression, head pose, and eyes. Due to the coupling of mixed information for the human voice and backing music in common music audio signals, we design a decouple-and-fuse strategy to tackle the challenge. We first decompose the input music audio into a human voice stream and a backing music stream. Due to the implicit and complicated correlation between the two-stream input signals and the dynamics of the facial expressions, head motions, and eye states, we model their relationship with an attention scheme, where the effects of the two streams are fused seamlessly. Furthermore, to improve the expressivenes of the generated results, we decompose head movement generation in terms of speed and direction, and decompose eye state generation into short-term blinking and long-term eye closing, modeling them separately. We have also built a novel dataset, SingingFace, to support training and evaluation of models for this task, including future work on this topic. Extensive experiments and a user study show that our proposed method is capable of synthesizing vivid singing faces, qualitatively and quantitatively better than the prior state-of-the-art.
文摘利用FACE(Free Air Carbon-dioxide Enrichment)平台技术,用伤流量法研究了低氮(LN150kg·hm-2)和常氮(NN250kg·hm-2)水平下,大气CO2浓度升高对水稻分蘖、抽穗期和穗后35d根系活力和根系N同化能力(氨基酸合成能力)的影响.结果表明,就整株水稻来看,CO2浓度升高和N处理对根系活力无显著影响;但由于FACE条件下水稻分蘖数增加14.5%(LN)和20.7%(NN),使每茎根系活力(伤流强度)降低1.4%~21.7%.在分蘖和抽穗期,虽然FACE处理促进了根系吸收的无机N向氨基酸转化,根系伤流液中氨基酸氮/无机氮提高11.1%~143.1%,但氨基酸浓度和合成总量和对照相比无明显差异.在穗后35d,FACE处理减弱了水稻根系的N同化能力,表现为根系伤流液中氨基酸/无机氮降低38.1%(LN)和29.2%(NN);同时氨基酸浓度降低34.0%(LN)和44.7%(NN),氨基酸合成总量降低50.8%(LN)和40.0%(NN).提高施氮水平促进了抽穗期水稻根系对无机氮的吸收,伤流液中无机氮含量增加51.1%(对照)和155.2%(FACE),但并未增加氨基酸合成量,由此导致抽穗期氨基酸氮/无机氮显著降低19.5%(对照)和36.8%(FACE);同时,氮处理在这个时期与FACE处理表现出明显的交互作用.
基金the Double First-Class Innovation Research Projectfor People’s Public Security University of China (No. 2023SYL08).
文摘Voice portrait technology has explored and established the relationship between speakers’ voices and their facialfeatures, aiming to generate corresponding facial characteristics by providing the voice of an unknown speaker.Due to its powerful advantages in image generation, Generative Adversarial Networks (GANs) have now beenwidely applied across various fields. The existing Voice2Face methods for voice portraits are primarily based onGANs trained on voice-face paired datasets. However, voice portrait models solely constructed on GANs facelimitations in image generation quality and struggle to maintain facial similarity. Additionally, the training processis relatively unstable, thereby affecting the overall generative performance of the model. To overcome the abovechallenges,wepropose a novel deepGenerativeAdversarialNetworkmodel for audio-visual synthesis, namedAVPGAN(Attention-enhanced Voice Portrait Model using Generative Adversarial Network). This model is based ona convolutional attention mechanism and is capable of generating corresponding facial images from the voice ofan unknown speaker. Firstly, to address the issue of training instability, we integrate convolutional neural networkswith deep GANs. In the network architecture, we apply spectral normalization to constrain the variation of thediscriminator, preventing issues such as mode collapse. Secondly, to enhance the model’s ability to extract relevantfeatures between the two modalities, we propose a voice portrait model based on convolutional attention. Thismodel learns the mapping relationship between voice and facial features in a common space from both channeland spatial dimensions independently. Thirdly, to enhance the quality of generated faces, we have incorporated adegradation removal module and utilized pretrained facial GANs as facial priors to repair and enhance the clarityof the generated facial images. Experimental results demonstrate that our AVP-GAN achieved a cosine similarity of0.511, outperforming the performance of our comparison model, and effectively achieved the generation of highqualityfacial images corresponding to a speaker’s voice.
文摘Although initiated more than 20 years ago, there has been an explosion of scientific interest in computerized recognition of human faces in recent years. In this survey, we give an introductory course of this area. Our focusis put on some new techniques, whose advantages and disadvantages are reviewed. Also, this survey summarizes thework of this area done in China. Finally, some conclusions are given.