In recent years,deep learning techniques have been used to estimate gaze-a significant task in computer vision and human-computer interaction.Previous studies have made significant achievements in predicting 2D or 3D ...In recent years,deep learning techniques have been used to estimate gaze-a significant task in computer vision and human-computer interaction.Previous studies have made significant achievements in predicting 2D or 3D gazes from monocular face images.This study presents a deep neural network for 2D gaze estimation on mobile devices.It achieves state-of-the-art 2D gaze point regression error,while significantly improving gaze classification error on quadrant divisions of the display.To this end,an efficient attention-based module that correlates and fuses the left and right eye contextual features is first proposed to improve gaze point regression performance.Subsequently,through a unified perspective for gaze estimation,metric learning for gaze classification on quadrant divisions is incorporated as additional supervision.Consequently,both gaze point regression and quadrant classification perfor-mances are improved.The experiments demonstrate that the proposed method outperforms existing gaze-estima-tion methods on the GazeCapture and MPIIFaceGaze datasets.展开更多
Light fields are vector functions that map the geometry of light rays to the corresponding plenoptic attributes.They describe the holographic information of scenes by representing the amount of light flowing in every ...Light fields are vector functions that map the geometry of light rays to the corresponding plenoptic attributes.They describe the holographic information of scenes by representing the amount of light flowing in every direction through every point in space.The physical concept of light fields was first proposed in 1936,and light fields are becoming increasingly important in the field of computer graphics,especially with the fast growth of computing capacity as well as network bandwidth.In this article,light field imaging is reviewed from the following aspects with an emphasis on the achievements of the past five years:(1)depth estimation,(2)content editing,(3)image quality,(4)scene reconstruction and view synthesis,and(5)industrial products because the technologies of lights fields also intersect with industrial applications.State-of-the-art research has focused on light field acquisition,manipulation,and display.In addition,the research has extended from the laboratory to industry.According to these achievements and challenges,in the near future,the applications of light fields could offer more portability,accessibility,compatibility,and ability to visualize the world.展开更多
In this paper,we tackle the challenging problem of point cloud completion from the perspective of feature learning.Our key observation is that to recover the underlying structures as well as surface details,given part...In this paper,we tackle the challenging problem of point cloud completion from the perspective of feature learning.Our key observation is that to recover the underlying structures as well as surface details,given partial input,a fundamental component is a good feature representation that can capture both global structure and local geometric details.We accordingly first propose FSNet,a feature structuring module that can adaptively aggregate point-wise features into a 2D structured feature map by learning multiple latent patterns from local regions.We then integrate FSNet into a coarse-to-fine pipeline for point cloud completion.Specifically,a 2D convolutional neural network is adopted to decode feature maps from FSNet into a coarse and complete point cloud.Next,a point cloud upsampling network is used to generate a dense point cloud from the partial input and the coarse intermediate output.To efficiently exploit local structures and enhance point distribution uniformity,we propose IFNet,a point upsampling module with a self-correction mechanism that can progressively refine details of the generated dense point cloud.We have conducted qualitative and quantitative experiments on ShapeNet,MVP,and KITTI datasets,which demonstrate that our method outperforms stateof-the-art point cloud completion approaches.展开更多
Reconstructing dynamic scenes with commodity depth cameras has many applications in computer graphics,computer vision,and robotics.However,due to the presence of noise and erroneous observations from data capturing de...Reconstructing dynamic scenes with commodity depth cameras has many applications in computer graphics,computer vision,and robotics.However,due to the presence of noise and erroneous observations from data capturing devices and the inherently ill-posed nature of non-rigid registration with insufficient information,traditional approaches often produce low-quality geometry with holes,bumps,and misalignments.We propose a novel 3D dynamic reconstruction system,named HDR-Net-Fusion,which learns to simultaneously reconstruct and refine the geometry on the fly with a sparse embedded deformation graph of surfels,using a hierarchical deep reinforcement(HDR)network.The latter comprises two parts:a global HDR-Net which rapidly detects local regions with large geometric errors,and a local HDR-Net serving as a local patch refinement operator to promptly complete and enhance such regions.Training the global HDR-Net is formulated as a novel reinforcement learning problem to implicitly learn the region selection strategy with the goal of improving the overall reconstruction quality.The applicability and efficiency of our approach are demonstrated using a large-scale dynamic reconstruction dataset.Our method can reconstruct geometry with higher quality than traditional methods.展开更多
In this paper, we present Emotion-Aware Music Driven Movie Montage, a novel paradigm for the challenging task of generating movie montages. Specifically, given a movie and a piece of music as the guidance, our method ...In this paper, we present Emotion-Aware Music Driven Movie Montage, a novel paradigm for the challenging task of generating movie montages. Specifically, given a movie and a piece of music as the guidance, our method aims to generate a montage out of the movie that is emotionally consistent with the music. Unlike previous work such as video summarization, this task requires not only video content understanding, but also emotion analysis of both the input movie and music. To this end, we propose a two-stage framework, including a learning-based module for the prediction of emotion similarity and an optimization-based module for the selection and composition of candidate movie shots. The core of our method is to align and estimate emotional similarity between music clips and movie shots in a multi-modal latent space via contrastive learning. Subsequently, the montage generation is modeled as a joint optimization of emotion similarity and additional constraints such as scene-level story completeness and shot-level rhythm synchronization. We conduct both qualitative and quantitative evaluations to demonstrate that our method can generate emotionally consistent montages and outperforms alternative baselines.展开更多
Distinguishing aesthetically pleasing food photos from others is an important visual analysis task for social media and ranking systems related to food.Nevertheless,aesthetic assessment of food images remains a challe...Distinguishing aesthetically pleasing food photos from others is an important visual analysis task for social media and ranking systems related to food.Nevertheless,aesthetic assessment of food images remains a challenging and relatively unexplored task,largely due to the lack of related food image datasets and practical knowledge.Thus,we present the Gourmet Photography Dataset(GPD),the first largescale dataset for aesthetic assessment of food photos.It contains 24,000 images with corresponding binary aesthetic labels,covering a large variety of foods and scenes.We also provide a non-stationary regularization method to combat over-fitting and enhance the ability of tuned models to generalize.Quantitative results from extensive experiments,including a generalization ability test,verify that neural networks trained on the GPD achieve comparable performance to human experts on the task of aesthetic assessment.We reveal several valuable findings to support further research and applications related to visual aesthetic analysis of food images.To encourage further research,we have made the GPD publicly available at https://github.com/Openning07/GPA.展开更多
Underwater robotic operation usually requires visual perception(e.g.,object detection and tracking),but underwater scenes have poor visual quality and represent a special domain which can affect the accuracy of visual...Underwater robotic operation usually requires visual perception(e.g.,object detection and tracking),but underwater scenes have poor visual quality and represent a special domain which can affect the accuracy of visual perception.In addition,detection continuity and stability are important for robotic perception,but the commonly used static accuracy based evaluation(i.e.,average precision)is insufficient to reflect detector performance across time.In response to these two problems,we present a design for a novel robotic visual perception framework.First,we generally investigate the relationship between a quality-diverse data domain and visual restoration in detection performance.As a result,although domain quality has an ignorable effect on within-domain detection accuracy,visual restoration is beneficial to detection in real sea scenarios by reducing the domain shift.Moreover,non-reference assessments are proposed for detection continuity and stability based on object tracklets.Further,online tracklet refinement is developed to improve the temporal performance of detectors.Finally,combined with visual restoration,an accurate and stable underwater robotic visual perception framework is established.Small-overlap suppression is proposed to extend video object detection(VID)methods to a single-object tracking task,leading to the flexibility to switch between detection and tracking.Extensive experiments were conducted on the ImageNet VID dataset and real-world robotic tasks to verify the correctness of our analysis and the superiority of our proposed approaches.The codes are available at https://github.com/yrqs/VisPerception.展开更多
With the popularization of social media,the way of information transmission has changed,and the prediction of information popularity based on social media platforms has attracted extensive attention.Feature fusion-bas...With the popularization of social media,the way of information transmission has changed,and the prediction of information popularity based on social media platforms has attracted extensive attention.Feature fusion-based media popularity prediction methods focus on the multi-modal features of social media,which aim at exploring the key factors affecting media popularity.Meanwhile,the methods make up for the deficiency in feature utilization of traditional methods based on information propagation processes.In this paper,we review feature fusion-based media popularity prediction methods from the perspective of feature extraction and predictive model construction.Before that,we analyze the influencing factors of media popularity to provide intuitive understanding.We further argue about the advantages and disadvantages of existing methods and datasets to highlight the future directions.Finally,we discuss the applications of popularity prediction.To the best of our knowledge,this is the first survey reporting feature fusion-based media popularity prediction methods.展开更多
基金the National Natural Science Foundation of China,No.61932003and the Fundamental Research Funds for the Central Universities.
文摘In recent years,deep learning techniques have been used to estimate gaze-a significant task in computer vision and human-computer interaction.Previous studies have made significant achievements in predicting 2D or 3D gazes from monocular face images.This study presents a deep neural network for 2D gaze estimation on mobile devices.It achieves state-of-the-art 2D gaze point regression error,while significantly improving gaze classification error on quadrant divisions of the display.To this end,an efficient attention-based module that correlates and fuses the left and right eye contextual features is first proposed to improve gaze point regression performance.Subsequently,through a unified perspective for gaze estimation,metric learning for gaze classification on quadrant divisions is incorporated as additional supervision.Consequently,both gaze point regression and quadrant classification perfor-mances are improved.The experiments demonstrate that the proposed method outperforms existing gaze-estima-tion methods on the GazeCapture and MPIIFaceGaze datasets.
基金The last author was supported by the National Key R&D Program of China,No.2019YFB1405703.
文摘Light fields are vector functions that map the geometry of light rays to the corresponding plenoptic attributes.They describe the holographic information of scenes by representing the amount of light flowing in every direction through every point in space.The physical concept of light fields was first proposed in 1936,and light fields are becoming increasingly important in the field of computer graphics,especially with the fast growth of computing capacity as well as network bandwidth.In this article,light field imaging is reviewed from the following aspects with an emphasis on the achievements of the past five years:(1)depth estimation,(2)content editing,(3)image quality,(4)scene reconstruction and view synthesis,and(5)industrial products because the technologies of lights fields also intersect with industrial applications.State-of-the-art research has focused on light field acquisition,manipulation,and display.In addition,the research has extended from the laboratory to industry.According to these achievements and challenges,in the near future,the applications of light fields could offer more portability,accessibility,compatibility,and ability to visualize the world.
基金This work was supported by the National Natural Science Foundation of China(61872250,U2001206,U21B2023)the GD Natural Science Foundation(2021B1515020085)+2 种基金DEGP Innovation Team(2022KCXTD025)Shenzhen Science and Technology Innovation Program(JCYJ20210324120213036)Guangdong Laboratory of Artificial Intelligence and Digital Economy(SZ).
文摘In this paper,we tackle the challenging problem of point cloud completion from the perspective of feature learning.Our key observation is that to recover the underlying structures as well as surface details,given partial input,a fundamental component is a good feature representation that can capture both global structure and local geometric details.We accordingly first propose FSNet,a feature structuring module that can adaptively aggregate point-wise features into a 2D structured feature map by learning multiple latent patterns from local regions.We then integrate FSNet into a coarse-to-fine pipeline for point cloud completion.Specifically,a 2D convolutional neural network is adopted to decode feature maps from FSNet into a coarse and complete point cloud.Next,a point cloud upsampling network is used to generate a dense point cloud from the partial input and the coarse intermediate output.To efficiently exploit local structures and enhance point distribution uniformity,we propose IFNet,a point upsampling module with a self-correction mechanism that can progressively refine details of the generated dense point cloud.We have conducted qualitative and quantitative experiments on ShapeNet,MVP,and KITTI datasets,which demonstrate that our method outperforms stateof-the-art point cloud completion approaches.
基金This work was supported by the National Natural Science Foundation of China(Grant Nos.61902210 and 61521002).
文摘Reconstructing dynamic scenes with commodity depth cameras has many applications in computer graphics,computer vision,and robotics.However,due to the presence of noise and erroneous observations from data capturing devices and the inherently ill-posed nature of non-rigid registration with insufficient information,traditional approaches often produce low-quality geometry with holes,bumps,and misalignments.We propose a novel 3D dynamic reconstruction system,named HDR-Net-Fusion,which learns to simultaneously reconstruct and refine the geometry on the fly with a sparse embedded deformation graph of surfels,using a hierarchical deep reinforcement(HDR)network.The latter comprises two parts:a global HDR-Net which rapidly detects local regions with large geometric errors,and a local HDR-Net serving as a local patch refinement operator to promptly complete and enhance such regions.Training the global HDR-Net is formulated as a novel reinforcement learning problem to implicitly learn the region selection strategy with the goal of improving the overall reconstruction quality.The applicability and efficiency of our approach are demonstrated using a large-scale dynamic reconstruction dataset.Our method can reconstruct geometry with higher quality than traditional methods.
基金supported by the National Key Research and Development Program of China under Grant No.2020AAA0106200 and the National Natural Science Foundation of China under Grant No.61832016.
文摘In this paper, we present Emotion-Aware Music Driven Movie Montage, a novel paradigm for the challenging task of generating movie montages. Specifically, given a movie and a piece of music as the guidance, our method aims to generate a montage out of the movie that is emotionally consistent with the music. Unlike previous work such as video summarization, this task requires not only video content understanding, but also emotion analysis of both the input movie and music. To this end, we propose a two-stage framework, including a learning-based module for the prediction of emotion similarity and an optimization-based module for the selection and composition of candidate movie shots. The core of our method is to align and estimate emotional similarity between music clips and movie shots in a multi-modal latent space via contrastive learning. Subsequently, the montage generation is modeled as a joint optimization of emotion similarity and additional constraints such as scene-level story completeness and shot-level rhythm synchronization. We conduct both qualitative and quantitative evaluations to demonstrate that our method can generate emotionally consistent montages and outperforms alternative baselines.
基金supported by the National Natural Science Foundation of China under Grant Nos.61832016,61672520CASIA-Tencent Youtu joint research project。
文摘Distinguishing aesthetically pleasing food photos from others is an important visual analysis task for social media and ranking systems related to food.Nevertheless,aesthetic assessment of food images remains a challenging and relatively unexplored task,largely due to the lack of related food image datasets and practical knowledge.Thus,we present the Gourmet Photography Dataset(GPD),the first largescale dataset for aesthetic assessment of food photos.It contains 24,000 images with corresponding binary aesthetic labels,covering a large variety of foods and scenes.We also provide a non-stationary regularization method to combat over-fitting and enhance the ability of tuned models to generalize.Quantitative results from extensive experiments,including a generalization ability test,verify that neural networks trained on the GPD achieve comparable performance to human experts on the task of aesthetic assessment.We reveal several valuable findings to support further research and applications related to visual aesthetic analysis of food images.To encourage further research,we have made the GPD publicly available at https://github.com/Openning07/GPA.
基金Project supported by the National Natural Science Foundation of China(Nos.61633004,61725305,and 62073196)the S&T Program of Hebei Province,China(No.F2020203037)。
文摘Underwater robotic operation usually requires visual perception(e.g.,object detection and tracking),but underwater scenes have poor visual quality and represent a special domain which can affect the accuracy of visual perception.In addition,detection continuity and stability are important for robotic perception,but the commonly used static accuracy based evaluation(i.e.,average precision)is insufficient to reflect detector performance across time.In response to these two problems,we present a design for a novel robotic visual perception framework.First,we generally investigate the relationship between a quality-diverse data domain and visual restoration in detection performance.As a result,although domain quality has an ignorable effect on within-domain detection accuracy,visual restoration is beneficial to detection in real sea scenarios by reducing the domain shift.Moreover,non-reference assessments are proposed for detection continuity and stability based on object tracklets.Further,online tracklet refinement is developed to improve the temporal performance of detectors.Finally,combined with visual restoration,an accurate and stable underwater robotic visual perception framework is established.Small-overlap suppression is proposed to extend video object detection(VID)methods to a single-object tracking task,leading to the flexibility to switch between detection and tracking.Extensive experiments were conducted on the ImageNet VID dataset and real-world robotic tasks to verify the correctness of our analysis and the superiority of our proposed approaches.The codes are available at https://github.com/yrqs/VisPerception.
基金supported in part by National Natural Science Foundation of China(62002257,U21B2024)the Funding Project of the State Key Laboratory of Communication Content Cognition(Grant No.A02106)+1 种基金the Open Funding Project of the State Key Laboratory of Communication Content Cognition(Grant No.20K04)the China Postdoctoral Science Foundation(2021M692395).
文摘With the popularization of social media,the way of information transmission has changed,and the prediction of information popularity based on social media platforms has attracted extensive attention.Feature fusion-based media popularity prediction methods focus on the multi-modal features of social media,which aim at exploring the key factors affecting media popularity.Meanwhile,the methods make up for the deficiency in feature utilization of traditional methods based on information propagation processes.In this paper,we review feature fusion-based media popularity prediction methods from the perspective of feature extraction and predictive model construction.Before that,we analyze the influencing factors of media popularity to provide intuitive understanding.We further argue about the advantages and disadvantages of existing methods and datasets to highlight the future directions.Finally,we discuss the applications of popularity prediction.To the best of our knowledge,this is the first survey reporting feature fusion-based media popularity prediction methods.