Video colorization is a challenging and highly ill-posed problem.Although recent years have witnessed remarkable progress in single image colorization,there is relatively less research effort on video colorization,and...Video colorization is a challenging and highly ill-posed problem.Although recent years have witnessed remarkable progress in single image colorization,there is relatively less research effort on video colorization,and existing methods always suffer from severe flickering artifacts(temporal inconsistency)or unsatisfactory colorization.We address this problem from a new perspective,by jointly considering colorization and temporal consistency in a unified framework.Specifically,we propose a novel temporally consistent video colorization(TCVC)framework.TCVC effectively propagates frame-level deep features in a bidirectional way to enhance the temporal consistency of colorization.Furthermore,TCVC introduces a self-regularization learning(SRL)scheme to minimize the differences in predictions obtained using different time steps.SRL does not require any ground-truth color videos for training and can further improve temporal consistency.Experiments demonstrate that our method can not only provide visually pleasing colorized video,but also with clearly better temporal consistency than state-of-the-art methods.A video demo is provided at https://www.youtube.com/watch?v=c7dczMs-olE,while code is available at https://github.com/lyh-18/TCVC-Tem porally-Consistent-Video-Colorization.展开更多
Visual object tracking has been drawing increasing attention in recent years,as a fundamental task in computer vision.To extend the range of tracking applications,researchers have been introducing information from mul...Visual object tracking has been drawing increasing attention in recent years,as a fundamental task in computer vision.To extend the range of tracking applications,researchers have been introducing information from multiple modalities to handle specific scenes,with promising research prospects for emerging methods and benchmarks.To provide a thorough review of multi-modal tracking,different aspects of multi-modal tracking algorithms are summarized under a unified taxonomy,with specific focus on visibledepth(RGB-D)and visible-thermal(RGB-T)tracking.Subsequently,a detailed description of the related benchmarks and challenges is provided.Extensive experiments were conducted to analyze the effectiveness of trackers on five datasets:PTB,VOT19-RGBD,GTOT,RGBT234,and VOT19-RGBT.Finally,various future directions,including model design and dataset construction,are discussed from different perspectives for further research.展开更多
The explosive growth of social media means portrait editing and retouching are in high demand.While portraits are commonly captured and stored as raster images,editing raster images is non-trivial and requires the use...The explosive growth of social media means portrait editing and retouching are in high demand.While portraits are commonly captured and stored as raster images,editing raster images is non-trivial and requires the user to be highly skilled.Aiming at developing intuitive and easy-to-use portrait editing tools,we propose a novel vectorization method that can automatically convert raster images into a 3-tier hierarchical representation.The base layer consists of a set of sparse diffusion curves(DCs)which characterize salient geometric features and low-frequency colors,providing a means for semantic color transfer and facial expression editing.The middle level encodes specular highlights and shadows as large,editable Poisson regions(PRs)and allows the user to directly adjust illumination by tuning the strength and changing the shapes of PRs.The top level contains two types of pixel-sized PRs for high-frequency residuals and fine details such as pimples and pigmentation.We train a deep generative model that can produce high-frequency residuals automatically.Thanks to the inherent meaning in vector primitives,editing portraits becomes easy and intuitive.In particular,our method supports color transfer,facial expression editing,highlight and shadow editing,and automatic retouching.To quantitatively evaluate the results,we extend the commonly used FLIP metric(which measures color and feature differences between two images)to consider illumination.The new metric,illumination-sensitive FLIP,can effectively capture salient changes in color transfer results,and is more consistent with human perception than FLIP and other quality measures for portrait images.We evaluate our method on the FFHQR dataset and show it to be effective for common portrait editing tasks,such as retouching,light editing,color transfer,and expression editing.展开更多
Recent studies have indicated that foundation models, such as BERT and GPT, excel atadapting to various downstream tasks. This adaptability has made them a dominant force in buildingartificial intelligence (AI) system...Recent studies have indicated that foundation models, such as BERT and GPT, excel atadapting to various downstream tasks. This adaptability has made them a dominant force in buildingartificial intelligence (AI) systems. Moreover, a newresearch paradigm has emerged as visualizationtechniques are incorporated into these models. Thisstudy divides these intersections into two researchareas: visualization for foundation model (VIS4FM)and foundation model for visualization (FM4VIS).In terms of VIS4FM, we explore the primary roleof visualizations in understanding, refining, and evaluating these intricate foundation models. VIS4FMaddresses the pressing need for transparency, explainability, fairness, and robustness. Conversely, in termsof FM4VIS, we highlight how foundation models canbe used to advance the visualization field itself. Theintersection of foundation models with visualizations ispromising but also introduces a set of challenges. Byhighlighting these challenges and promising opportunities, this study aims to provide a starting point forthe continued exploration of this research avenue.展开更多
Robustness and generalization are two challenging problems for learning point cloud representation.To tackle these problems,we first design a novel geometry coding model,which can effectively use an invariant eigengra...Robustness and generalization are two challenging problems for learning point cloud representation.To tackle these problems,we first design a novel geometry coding model,which can effectively use an invariant eigengraph to group points with similar geometric information,even when such points are far from each other.We also introduce a large-scale point cloud dataset,PCNet184.It consists of 184 categories and 51,915 synthetic objects,which brings new challenges for point cloud classification,and provides a new benchmark to assess point cloud cross-domain generalization.Finally,we perform extensive experiments on point cloud classification,using ModelNet40,ScanObjectNN,and our PCNet184,and segmentation,using ShapeNetPart and S3DIS.Our method achieves comparable performance to state-of-the-art methods on these datasets,for both supervised and unsupervised learning.Code and our dataset are available at https://github.com/MingyeXu/PCNet184.展开更多
Visual simultaneous localisation and mapping(vSLAM)finds applications for indoor and outdoor navigation that routinely subjects it to visual complexities,particularly mirror reflections.The effect of mirror presence(t...Visual simultaneous localisation and mapping(vSLAM)finds applications for indoor and outdoor navigation that routinely subjects it to visual complexities,particularly mirror reflections.The effect of mirror presence(time visible and its average size in the frame)was hypothesised to impact localisation and mapping performance,with systems using direct techniques expected to perform worse.Thus,a dataset,MirrEnv,of image sequences recorded in mirror environments,was collected,and used to evaluate the performance of existing representative methods.RGBD ORB-SLAM3 and BundleFusion appear to show moderate degradation of absolute trajectory error with increasing mirror duration,whilst the remaining results did not show significantly degraded localisation performance.The mesh maps generated proved to be very inaccurate,with real and virtual reflections colliding in the reconstructions.A discussion is given of the likely sources of error and robustness in mirror environments,outlining future directions for validating and improving vSLAM performance in the presence of planar mirrors.The MirrEnv dataset is available at https://doi.org/10.17035/d.2023.0292477898.展开更多
Learning and inferring underlying motion patterns of captured 2D scenes and then re-creating dynamic evolution consistent with the real-world natural phenomena have high appeal for graphics and animation.To bridge the...Learning and inferring underlying motion patterns of captured 2D scenes and then re-creating dynamic evolution consistent with the real-world natural phenomena have high appeal for graphics and animation.To bridge the technical gap between virtual and real environments,we focus on the inverse modeling and reconstruction of visually consistent and property-verifiable oceans,taking advantage of deep learning and differentiable physics to learn geometry and constitute waves in a self-supervised manner.First,we infer hierarchical geometry using two networks,which are optimized via the differentiable renderer.We extract wave components from the sequence of inferred geometry through a network equipped with a differentiable ocean model.Then,ocean dynamics can be evolved using the reconstructed wave components.Through extensive experiments,we verify that our new method yields satisfactory results for both geometry reconstruction and wave estimation.Moreover,the new framework has the inverse modeling potential to facilitate a host of graphics applications,such as the rapid production of physically accurate scene animation and editing guided by real ocean scenes.展开更多
Hierarchical multi-granularity image classification is a challenging task that aims to tag each given image with multiple granularity labels simultaneously.Existing methods tend to overlook that different image region...Hierarchical multi-granularity image classification is a challenging task that aims to tag each given image with multiple granularity labels simultaneously.Existing methods tend to overlook that different image regions contribute differently to label prediction at different granularities,and also insufficiently consider relationships between the hierarchical multi-granularity labels.We introduce a sequence-to-sequence mechanism to overcome these two problems and propose a multi-granularity sequence generation(MGSG)approach for the hierarchical multi-granularity image classification task.Specifically,we introduce a transformer architecture to encode the image into visual representation sequences.Next,we traverse the taxonomic tree and organize the multi-granularity labels into sequences,and vectorize them and add positional information.The proposed multi-granularity sequence generation method builds a decoder that takes visual representation sequences and semantic label embedding as inputs,and outputs the predicted multi-granularity label sequence.The decoder models dependencies and correlations between multi-granularity labels through a masked multi-head self-attention mechanism,and relates visual information to the semantic label information through a crossmodality attention mechanism.In this way,the proposed method preserves the relationships between labels at different granularity levels and takes into account the influence of different image regions on labels with different granularities.Evaluations on six public benchmarks qualitatively and quantitatively demonstrate the advantages of the proposed method.Our project is available at https://github.com/liuxindazz/mgs.展开更多
Recently, facial-expression recognition (FER)has primarily focused on images in the wild, includingfactors such as face occlusion and image blurring, ratherthan laboratory images. Complex field environmentshave introd...Recently, facial-expression recognition (FER)has primarily focused on images in the wild, includingfactors such as face occlusion and image blurring, ratherthan laboratory images. Complex field environmentshave introduced new challenges to FER. To addressthese challenges, this study proposes a cross-fusion dualattention network. The network comprises three parts:(1) a cross-fusion grouped dual-attention mechanism torefine local features and obtain global information;(2) aproposed C2 activation function construction method,which is a piecewise cubic polynomial with threedegrees of freedom, requiring less computation withimproved flexibility and recognition abilities, whichcan better address slow running speeds and neuroninactivation problems;and (3) a closed-loop operationbetween the self-attention distillation process andresidual connections to suppress redundant informationand improve the generalization ability of the model.The recognition accuracies on the RAF-DB, FERPlus,and AffectNet datasets were 92.78%, 92.02%, and63.58%, respectively. Experiments show that this modelcan provide more effective solutions for FER tasks.展开更多
The cross-view matching of local image features is a fundamental task in visual localization and 3D reconstruction.This study proposes FilterGNN,a transformer-based graph neural network(GNN),aiming to improve the matc...The cross-view matching of local image features is a fundamental task in visual localization and 3D reconstruction.This study proposes FilterGNN,a transformer-based graph neural network(GNN),aiming to improve the matching efficiency and accuracy of visual descriptors.Based on high matching sparseness and coarse-to-fine covisible area detection,FilterGNN utilizes cascaded optimal graph-matching filter modules to dynamically reject outlier matches.Moreover,we successfully adapted linear attention in FilterGNN with post-instance normalization support,which significantly reduces the complexity of complete graph learning from O(N2)to O(N).Experiments show that FilterGNN requires only 6%of the time cost and 33.3%of the memory cost compared with SuperGlue under a large-scale input size and achieves a competitive performance in various tasks,such as pose estimation,visual localization,and sparse 3D reconstruction.展开更多
Despite significant developments in 3D multi-view multi-person (3D MM) tracking, current frameworks separately target footprint tracking, or pose tracking. Frameworks designed for the former cannot be used for the lat...Despite significant developments in 3D multi-view multi-person (3D MM) tracking, current frameworks separately target footprint tracking, or pose tracking. Frameworks designed for the former cannot be used for the latter, because they directly obtain 3D positions on the ground plane via a homography projection, which is inapplicable to 3D poses above the ground. In contrast, frameworks designed for pose tracking generally isolate multi-view and multi-frame associations and may not be sufficiently robust for footprint tracking, which utilizes fewer key points than pose tracking, weakening multi-view association cues in a single frame. This study presents a unified multi-view multi-person tracking framework to bridge the gap between footprint tracking and pose tracking. Without additional modifications, the framework can adopt monocular 2D bounding boxes and 2D poses as its input to produce robust 3D trajectories for multiple persons. Importantly, multi-frame and multi-view information are jointly employed to improve association and triangulation. Our framework is shown to provide state-of-the-art performance on the Campus and Shelf datasets for 3D pose tracking, with comparable results on the WILDTRACK and MMPTRACK datasets for 3D footprint tracking.展开更多
Estimating 3D hand shape from a single-view RGB image is important for many applications.However,the diversity of hand shapes and postures,depth ambiguity,and occlusion may result in pose errors and noisy hand meshes....Estimating 3D hand shape from a single-view RGB image is important for many applications.However,the diversity of hand shapes and postures,depth ambiguity,and occlusion may result in pose errors and noisy hand meshes.Making full use of 2D cues such as 2D pose can effectively improve the quality of 3D human hand shape estimation.In this paper,we use 2D joint heatmaps to obtain spatial details for robust pose estimation.We also introduce a depth-independent 2D mesh to avoid depth ambiguity in mesh regression for efficient hand-image alignment.Our method has four cascaded stages:2D cue extraction,pose feature encoding,initial reconstruction,and reconstruction refinement.Specifically,we first encode the image to determine semantic features during 2D cue extraction;this is also used to predict hand joints and for segmentation.Then,during the pose feature encoding stage,we use a hand joints encoder to learn spatial information from the joint heatmaps.Next,a coarse 3D hand mesh and 2D mesh are obtained in the initial reconstruction step;a mesh squeeze-and-excitation block is used to fuse different hand features to enhance perception of 3D hand structures.Finally,a global mesh refinement stage learns non-local relations between vertices of the hand mesh from the predicted 2D mesh,to predict an offset hand mesh to fine-tune the reconstruction results.Quantitative and qualitative results on the FreiHAND benchmark dataset demonstrate that our approach achieves state-of-the-art performance.展开更多
Existing unsupervised person re-identification approaches fail to fully capture thefine-grained features of local regions,which can result in people with similar appearances and different identities being assigned the...Existing unsupervised person re-identification approaches fail to fully capture thefine-grained features of local regions,which can result in people with similar appearances and different identities being assigned the same label after clustering.The identity-independent information contained in different local regions leads to different levels of local noise.To address these challenges,joint training with local soft attention and dual cross-neighbor label smoothing(DCLS)is proposed in this study.First,the joint training is divided into global and local parts,whereby a soft attention mechanism is proposed for the local branch to accurately capture the subtle differences in local regions,which improves the ability of the re-identification model in identifying a person’s local significant features.Second,DCLS is designed to progressively mitigate label noise in different local regions.The DCLS uses global and local similarity metrics to semantically align the global and local regions of the person and further determines the proximity association between local regions through the cross information of neighboring regions,thereby achieving label smoothing of the global and local regions throughout the training process.In extensive experiments,the proposed method outperformed existing methods under unsupervised settings on several standard person re-identification datasets.展开更多
Free-viewpoint video allows the user to view objects from any virtual perspective,creating an immersive visual experience.This technology enhances the interactivity and freedom of multimedia performances.However,many ...Free-viewpoint video allows the user to view objects from any virtual perspective,creating an immersive visual experience.This technology enhances the interactivity and freedom of multimedia performances.However,many free-viewpoint video synthesis methods hardly satisfy the requirement to work in real time with high precision,particularly for sports fields having large areas and numerous moving objects.To address these issues,we propose a freeviewpoint video synthesis method based on distance field acceleration.The central idea is to fuse multiview distance field information and use it to adjust the search step size adaptively.Adaptive step size search is used in two ways:for fast estimation of multiobject three-dimensional surfaces,and synthetic view rendering based on global occlusion judgement.We have implemented our ideas using parallel computing for interactive display,using CUDA and OpenGL frameworks,and have used real-world and simulated experimental datasets for evaluation.The results show that the proposed method can render free-viewpoint videos with multiple objects on large sports fields at 25 fps.Furthermore,the visual quality of our synthetic novel viewpoint images exceeds that of state-of-the-art neural-rendering-based methods.展开更多
Computer-generated aesthetic patterns arewidely used as design materials in various fields. Themost common methods use fractals or dynamicalsystems as basic tools to create various patterns. Toenhance aesthetics and c...Computer-generated aesthetic patterns arewidely used as design materials in various fields. Themost common methods use fractals or dynamicalsystems as basic tools to create various patterns. Toenhance aesthetics and controllability, some researchershave introduced symmetric layouts along with thesetools. One popular strategy employs dynamical systemscompatible with symmetries that construct functionswith the desired symmetries. However, these aretypically confined to simple planar symmetries. Theother generates symmetrical patterns under theconstraints of tilings. Although it is slightly moreflexible, it is restricted to small ranges of tilingsand lacks textural variations. Thus, we proposed anew approach for generating aesthetic patterns bysymmetrizing quasi-regular patterns using general kuniformtilings. We adopted a unified strategy toconstruct invariant mappings for k-uniform tilings thatcan eliminate texture seams across the tiling edges.Furthermore, we constructed three types of symmetriesassociated with the patterns: dihedral, rotational, andreflection symmetries. The proposed method can beeasily implemented using GPU shaders and is highlyefficient and suitable for complicated tiling with regularpolygons. Experiments demonstrated the advantages of our method over state-of-the-art methods in terms offlexibility in controlling the generation of patterns withvarious parameters as well as the diversity of texturesand styles.展开更多
Inspired by the success of WaveNet in multi-subject speech synthesis,we propose a novel neural network based on causal convolutions for multi-subject motion modeling and generation.The network can capture the intrinsi...Inspired by the success of WaveNet in multi-subject speech synthesis,we propose a novel neural network based on causal convolutions for multi-subject motion modeling and generation.The network can capture the intrinsic characteristics of the motion of different subjects,such as the influence of skeleton scale variation on motion style.Moreover,after fine-tuning the network using a small motion dataset for a novel skeleton that is not included in the training dataset,it is able to synthesize high-quality motions with a personalized style for the novel skeleton.The experimental results demonstrate that our network can model the intrinsic characteristics of motions well and can be applied to various motion modeling and synthesis tasks.展开更多
The visual modeling method enables flexible interactions with rich graphical depictions of data and supports the exploration of the complexities of epidemiological analysis.However,most epidemiology visualizations do ...The visual modeling method enables flexible interactions with rich graphical depictions of data and supports the exploration of the complexities of epidemiological analysis.However,most epidemiology visualizations do not support the combined analysis of objective factors that might influence the transmission situation,resulting in a lack of quantitative and qualitative evidence.To address this issue,we developed a portrait-based visual modeling method called+msRNAer.This method considers the spatiotemporal features of virus transmission patterns and multidimensional features of objective risk factors in communities,enabling portrait-based exploration and comparison in epidemiological analysis.We applied+msRNAer to aggregate COVID-19-related datasets in New South Wales,Australia,combining COVID-19 case number trends,geo-information,intervention events,and expert-supervised risk factors extracted from local government area-based censuses.We perfected the+msRNAer workflow with collaborative views and evaluated its feasibility,effectiveness,and usefulness through one user study and three subject-driven case studies.Positive feedback from experts indicates that+msRNAer provides a general understanding for analyzing comprehension that not only compares relationships between cases in time-varying and risk factors through portraits but also supports navigation in fundamental geographical,timeline,and other factor comparisons.By adopting interactions,experts discovered functional and practical implications for potential patterns of long-standing community factors regarding the vulnerability faced by the pandemic.Experts confirmed that+msRNAer is expected to deliver visual modeling benefits with spatiotemporal and multidimensional features in other epidemiological analysis scenarios.展开更多
It remains an interesting and challenging problem to synthesize a vivid and realistic singing face driven by music. In this paper, we present a method for this task with natural motions for the lips, facial expression...It remains an interesting and challenging problem to synthesize a vivid and realistic singing face driven by music. In this paper, we present a method for this task with natural motions for the lips, facial expression, head pose, and eyes. Due to the coupling of mixed information for the human voice and backing music in common music audio signals, we design a decouple-and-fuse strategy to tackle the challenge. We first decompose the input music audio into a human voice stream and a backing music stream. Due to the implicit and complicated correlation between the two-stream input signals and the dynamics of the facial expressions, head motions, and eye states, we model their relationship with an attention scheme, where the effects of the two streams are fused seamlessly. Furthermore, to improve the expressivenes of the generated results, we decompose head movement generation in terms of speed and direction, and decompose eye state generation into short-term blinking and long-term eye closing, modeling them separately. We have also built a novel dataset, SingingFace, to support training and evaluation of models for this task, including future work on this topic. Extensive experiments and a user study show that our proposed method is capable of synthesizing vivid singing faces, qualitatively and quantitatively better than the prior state-of-the-art.展开更多
Template matching is a fundamental task in computer vision and has been studied for decades.It plays an essential role in manufacturing industry for estimating the poses of different parts,facilitating downstream task...Template matching is a fundamental task in computer vision and has been studied for decades.It plays an essential role in manufacturing industry for estimating the poses of different parts,facilitating downstream tasks such as robotic grasping.Existing methods fail when the template and source images have different modalities,cluttered backgrounds,or weak textures.They also rarely consider geometric transformations via homographies,which commonly exist even for planar industrial parts.To tackle the challenges,we propose an accurate template matching method based on differentiable coarse-tofine correspondence refinement.We use an edge-aware module to overcome the domain gap between the mask template and the grayscale image,allowing robust matching.An initial warp is estimated using coarse correspondences based on novel structure-aware information provided by transformers.This initial alignment is passed to a refinement network using references and aligned images to obtain sub-pixel level correspondences which are used to give the final geometric transformation.Extensive evaluation shows that our method to be significantly better than state-of-the-art methods and baselines,providing good generalization ability and visually plausible results even on unseen real data.展开更多
We propose a unified 3D flow frameworkfor joint learning of shape embedding and deformationfor different categories. Our goal is to recovershapes from imperfect point clouds by fitting thebest shape template in a shape...We propose a unified 3D flow frameworkfor joint learning of shape embedding and deformationfor different categories. Our goal is to recovershapes from imperfect point clouds by fitting thebest shape template in a shape repository afterdeformation. Accordingly, we learn a shape embeddingfor template retrieval and a flow-based network forrobust deformation. We note that the deformationflow can be quite different for different shapecategories. Therefore, we introduce a novel multi-hubmodule to learn multiple modes of deformation toincorporate such variation, providing a network whichcan handle a wide range of objects from differentcategories. The shape embedding is designed to retrievethe best-fit template as the nearest neighbor in a latentspace. We replace the standard fully connected layerwith a tiny structure in the embedding that significantlyreduces network complexity and further improvesdeformation quality. Experiments show the superiorityof our method to existing state-of-the-art methods viaqualitative and quantitative comparisons. Finally, ourmethod provides efficient and flexible deformation thatcan further be used for novel shape design.展开更多
基金supported by grants from the National Natural Science Foundation of China(61906184)the Joint Lab of CAS–HK,and the Shanghai Committee of Science and Technology,China(20DZ1100800,21DZ1100100).
文摘Video colorization is a challenging and highly ill-posed problem.Although recent years have witnessed remarkable progress in single image colorization,there is relatively less research effort on video colorization,and existing methods always suffer from severe flickering artifacts(temporal inconsistency)or unsatisfactory colorization.We address this problem from a new perspective,by jointly considering colorization and temporal consistency in a unified framework.Specifically,we propose a novel temporally consistent video colorization(TCVC)framework.TCVC effectively propagates frame-level deep features in a bidirectional way to enhance the temporal consistency of colorization.Furthermore,TCVC introduces a self-regularization learning(SRL)scheme to minimize the differences in predictions obtained using different time steps.SRL does not require any ground-truth color videos for training and can further improve temporal consistency.Experiments demonstrate that our method can not only provide visually pleasing colorized video,but also with clearly better temporal consistency than state-of-the-art methods.A video demo is provided at https://www.youtube.com/watch?v=c7dczMs-olE,while code is available at https://github.com/lyh-18/TCVC-Tem porally-Consistent-Video-Colorization.
基金supported in part by National Natural Science Foundation of China(Nos.U23A20384 and 62022021)in part by Joint Fund of Ministry of Education for Equipment Pre-research(No.8091B032155)+1 种基金in part by the National Defense Basic Scientific Research Program(No.WDZC20215250205)in part by Central Guidance on Local Science and Technology Development Fund of Liaoning Province(No.2022JH6/100100026).
文摘Visual object tracking has been drawing increasing attention in recent years,as a fundamental task in computer vision.To extend the range of tracking applications,researchers have been introducing information from multiple modalities to handle specific scenes,with promising research prospects for emerging methods and benchmarks.To provide a thorough review of multi-modal tracking,different aspects of multi-modal tracking algorithms are summarized under a unified taxonomy,with specific focus on visibledepth(RGB-D)and visible-thermal(RGB-T)tracking.Subsequently,a detailed description of the related benchmarks and challenges is provided.Extensive experiments were conducted to analyze the effectiveness of trackers on five datasets:PTB,VOT19-RGBD,GTOT,RGBT234,and VOT19-RGBT.Finally,various future directions,including model design and dataset construction,are discussed from different perspectives for further research.
基金This project was supported by the Ministry of Education,Singapore,under its Academic Research Fund Tier 1(RG20/20)the National Natural Science Foundation of China(61872347)the Special Plan for the Development of Distinguished Young Scientists of ISCAS(Y8RC535018).
文摘The explosive growth of social media means portrait editing and retouching are in high demand.While portraits are commonly captured and stored as raster images,editing raster images is non-trivial and requires the user to be highly skilled.Aiming at developing intuitive and easy-to-use portrait editing tools,we propose a novel vectorization method that can automatically convert raster images into a 3-tier hierarchical representation.The base layer consists of a set of sparse diffusion curves(DCs)which characterize salient geometric features and low-frequency colors,providing a means for semantic color transfer and facial expression editing.The middle level encodes specular highlights and shadows as large,editable Poisson regions(PRs)and allows the user to directly adjust illumination by tuning the strength and changing the shapes of PRs.The top level contains two types of pixel-sized PRs for high-frequency residuals and fine details such as pimples and pigmentation.We train a deep generative model that can produce high-frequency residuals automatically.Thanks to the inherent meaning in vector primitives,editing portraits becomes easy and intuitive.In particular,our method supports color transfer,facial expression editing,highlight and shadow editing,and automatic retouching.To quantitatively evaluate the results,we extend the commonly used FLIP metric(which measures color and feature differences between two images)to consider illumination.The new metric,illumination-sensitive FLIP,can effectively capture salient changes in color transfer results,and is more consistent with human perception than FLIP and other quality measures for portrait images.We evaluate our method on the FFHQR dataset and show it to be effective for common portrait editing tasks,such as retouching,light editing,color transfer,and expression editing.
基金supported by the National Natural Science Foundation of China(Grant Nos.U21A20469 and 61936002)the National Key R&D Program of China(Grant No.2020YFB2104100)grants from the Institute Guo Qiang,THUIBCS,and BLBCI.
文摘Recent studies have indicated that foundation models, such as BERT and GPT, excel atadapting to various downstream tasks. This adaptability has made them a dominant force in buildingartificial intelligence (AI) systems. Moreover, a newresearch paradigm has emerged as visualizationtechniques are incorporated into these models. Thisstudy divides these intersections into two researchareas: visualization for foundation model (VIS4FM)and foundation model for visualization (FM4VIS).In terms of VIS4FM, we explore the primary roleof visualizations in understanding, refining, and evaluating these intricate foundation models. VIS4FMaddresses the pressing need for transparency, explainability, fairness, and robustness. Conversely, in termsof FM4VIS, we highlight how foundation models canbe used to advance the visualization field itself. Theintersection of foundation models with visualizations ispromising but also introduces a set of challenges. Byhighlighting these challenges and promising opportunities, this study aims to provide a starting point forthe continued exploration of this research avenue.
基金This work was partially supported by the National Natural Science Foundation of China(Grant Nos.61876176 and U1813218)the Joint Lab of CAS–HK,the Shenzhen Research Program(Grant No.RCJC20200714114557087)+1 种基金the Shanghai Committee of Science and Technology(Grant No.21DZ1100100)Shenzhen Institute of Artificial Intelligence and Robotics for Society.
文摘Robustness and generalization are two challenging problems for learning point cloud representation.To tackle these problems,we first design a novel geometry coding model,which can effectively use an invariant eigengraph to group points with similar geometric information,even when such points are far from each other.We also introduce a large-scale point cloud dataset,PCNet184.It consists of 184 categories and 51,915 synthetic objects,which brings new challenges for point cloud classification,and provides a new benchmark to assess point cloud cross-domain generalization.Finally,we perform extensive experiments on point cloud classification,using ModelNet40,ScanObjectNN,and our PCNet184,and segmentation,using ShapeNetPart and S3DIS.Our method achieves comparable performance to state-of-the-art methods on these datasets,for both supervised and unsupervised learning.Code and our dataset are available at https://github.com/MingyeXu/PCNet184.
基金funded by the UK EPSRC through a Doctoral Training Partnership No.EP/T517951/1(2435656).
文摘Visual simultaneous localisation and mapping(vSLAM)finds applications for indoor and outdoor navigation that routinely subjects it to visual complexities,particularly mirror reflections.The effect of mirror presence(time visible and its average size in the frame)was hypothesised to impact localisation and mapping performance,with systems using direct techniques expected to perform worse.Thus,a dataset,MirrEnv,of image sequences recorded in mirror environments,was collected,and used to evaluate the performance of existing representative methods.RGBD ORB-SLAM3 and BundleFusion appear to show moderate degradation of absolute trajectory error with increasing mirror duration,whilst the remaining results did not show significantly degraded localisation performance.The mesh maps generated proved to be very inaccurate,with real and virtual reflections colliding in the reconstructions.A discussion is given of the likely sources of error and robustness in mirror environments,outlining future directions for validating and improving vSLAM performance in the presence of planar mirrors.The MirrEnv dataset is available at https://doi.org/10.17035/d.2023.0292477898.
基金sponsored by grants from the National Natural Science Foundation of China(62002010,61872347)the CAMS Innovation Fund for Medical Sciences(2019-I2M5-016)the Special Plan for the Development of Distinguished Young Scientists of ISCAS(Y8RC535018).
文摘Learning and inferring underlying motion patterns of captured 2D scenes and then re-creating dynamic evolution consistent with the real-world natural phenomena have high appeal for graphics and animation.To bridge the technical gap between virtual and real environments,we focus on the inverse modeling and reconstruction of visually consistent and property-verifiable oceans,taking advantage of deep learning and differentiable physics to learn geometry and constitute waves in a self-supervised manner.First,we infer hierarchical geometry using two networks,which are optimized via the differentiable renderer.We extract wave components from the sequence of inferred geometry through a network equipped with a differentiable ocean model.Then,ocean dynamics can be evolved using the reconstructed wave components.Through extensive experiments,we verify that our new method yields satisfactory results for both geometry reconstruction and wave estimation.Moreover,the new framework has the inverse modeling potential to facilitate a host of graphics applications,such as the rapid production of physically accurate scene animation and editing guided by real ocean scenes.
基金supported by National Key R&D Program of China(2019YFC1521102)the National Natural Science Foundation of China(61932003)Beijing Science and Technology Plan(Z221100007722004).
文摘Hierarchical multi-granularity image classification is a challenging task that aims to tag each given image with multiple granularity labels simultaneously.Existing methods tend to overlook that different image regions contribute differently to label prediction at different granularities,and also insufficiently consider relationships between the hierarchical multi-granularity labels.We introduce a sequence-to-sequence mechanism to overcome these two problems and propose a multi-granularity sequence generation(MGSG)approach for the hierarchical multi-granularity image classification task.Specifically,we introduce a transformer architecture to encode the image into visual representation sequences.Next,we traverse the taxonomic tree and organize the multi-granularity labels into sequences,and vectorize them and add positional information.The proposed multi-granularity sequence generation method builds a decoder that takes visual representation sequences and semantic label embedding as inputs,and outputs the predicted multi-granularity label sequence.The decoder models dependencies and correlations between multi-granularity labels through a masked multi-head self-attention mechanism,and relates visual information to the semantic label information through a crossmodality attention mechanism.In this way,the proposed method preserves the relationships between labels at different granularity levels and takes into account the influence of different image regions on labels with different granularities.Evaluations on six public benchmarks qualitatively and quantitatively demonstrate the advantages of the proposed method.Our project is available at https://github.com/liuxindazz/mgs.
基金supported in part by the National Natural Science Foundation of China under Grant Nos.62272281 and 62007017the Special Funds for Taishan Scholars Project under Grant No.tsqn202306274Youth Innovation Technology Project of the Higher School in Shandong Province under Grant No.2019KJN042.
文摘Recently, facial-expression recognition (FER)has primarily focused on images in the wild, includingfactors such as face occlusion and image blurring, ratherthan laboratory images. Complex field environmentshave introduced new challenges to FER. To addressthese challenges, this study proposes a cross-fusion dualattention network. The network comprises three parts:(1) a cross-fusion grouped dual-attention mechanism torefine local features and obtain global information;(2) aproposed C2 activation function construction method,which is a piecewise cubic polynomial with threedegrees of freedom, requiring less computation withimproved flexibility and recognition abilities, whichcan better address slow running speeds and neuroninactivation problems;and (3) a closed-loop operationbetween the self-attention distillation process andresidual connections to suppress redundant informationand improve the generalization ability of the model.The recognition accuracies on the RAF-DB, FERPlus,and AffectNet datasets were 92.78%, 92.02%, and63.58%, respectively. Experiments show that this modelcan provide more effective solutions for FER tasks.
基金supported by the National Natural Science Foundation of China(Grant No.62220106003)Tsinghua-Tencent Joint Laboratory for Internet Innovation Technology.
文摘The cross-view matching of local image features is a fundamental task in visual localization and 3D reconstruction.This study proposes FilterGNN,a transformer-based graph neural network(GNN),aiming to improve the matching efficiency and accuracy of visual descriptors.Based on high matching sparseness and coarse-to-fine covisible area detection,FilterGNN utilizes cascaded optimal graph-matching filter modules to dynamically reject outlier matches.Moreover,we successfully adapted linear attention in FilterGNN with post-instance normalization support,which significantly reduces the complexity of complete graph learning from O(N2)to O(N).Experiments show that FilterGNN requires only 6%of the time cost and 33.3%of the memory cost compared with SuperGlue under a large-scale input size and achieves a competitive performance in various tasks,such as pose estimation,visual localization,and sparse 3D reconstruction.
文摘Despite significant developments in 3D multi-view multi-person (3D MM) tracking, current frameworks separately target footprint tracking, or pose tracking. Frameworks designed for the former cannot be used for the latter, because they directly obtain 3D positions on the ground plane via a homography projection, which is inapplicable to 3D poses above the ground. In contrast, frameworks designed for pose tracking generally isolate multi-view and multi-frame associations and may not be sufficiently robust for footprint tracking, which utilizes fewer key points than pose tracking, weakening multi-view association cues in a single frame. This study presents a unified multi-view multi-person tracking framework to bridge the gap between footprint tracking and pose tracking. Without additional modifications, the framework can adopt monocular 2D bounding boxes and 2D poses as its input to produce robust 3D trajectories for multiple persons. Importantly, multi-frame and multi-view information are jointly employed to improve association and triangulation. Our framework is shown to provide state-of-the-art performance on the Campus and Shelf datasets for 3D pose tracking, with comparable results on the WILDTRACK and MMPTRACK datasets for 3D footprint tracking.
基金We would like to thank the reviewers for valuable comments.This work was supported by grants from the National Natural Science Foundation of China(61976227,62176096)the Natural Science Foundation of Hubei Province(2020CFA025).
文摘Estimating 3D hand shape from a single-view RGB image is important for many applications.However,the diversity of hand shapes and postures,depth ambiguity,and occlusion may result in pose errors and noisy hand meshes.Making full use of 2D cues such as 2D pose can effectively improve the quality of 3D human hand shape estimation.In this paper,we use 2D joint heatmaps to obtain spatial details for robust pose estimation.We also introduce a depth-independent 2D mesh to avoid depth ambiguity in mesh regression for efficient hand-image alignment.Our method has four cascaded stages:2D cue extraction,pose feature encoding,initial reconstruction,and reconstruction refinement.Specifically,we first encode the image to determine semantic features during 2D cue extraction;this is also used to predict hand joints and for segmentation.Then,during the pose feature encoding stage,we use a hand joints encoder to learn spatial information from the joint heatmaps.Next,a coarse 3D hand mesh and 2D mesh are obtained in the initial reconstruction step;a mesh squeeze-and-excitation block is used to fuse different hand features to enhance perception of 3D hand structures.Finally,a global mesh refinement stage learns non-local relations between vertices of the hand mesh from the predicted 2D mesh,to predict an offset hand mesh to fine-tune the reconstruction results.Quantitative and qualitative results on the FreiHAND benchmark dataset demonstrate that our approach achieves state-of-the-art performance.
基金supported by the National Natural Science Foundation of China under Grant Nos.62076117 and 62166026the Jiangxi Key Laboratory of Smart City under Grant No.20192BCD40002Jiangxi Provincial Natural Science Foundation under Grant No.20224BAB212011.
文摘Existing unsupervised person re-identification approaches fail to fully capture thefine-grained features of local regions,which can result in people with similar appearances and different identities being assigned the same label after clustering.The identity-independent information contained in different local regions leads to different levels of local noise.To address these challenges,joint training with local soft attention and dual cross-neighbor label smoothing(DCLS)is proposed in this study.First,the joint training is divided into global and local parts,whereby a soft attention mechanism is proposed for the local branch to accurately capture the subtle differences in local regions,which improves the ability of the re-identification model in identifying a person’s local significant features.Second,DCLS is designed to progressively mitigate label noise in different local regions.The DCLS uses global and local similarity metrics to semantically align the global and local regions of the person and further determines the proximity association between local regions through the cross information of neighboring regions,thereby achieving label smoothing of the global and local regions throughout the training process.In extensive experiments,the proposed method outperformed existing methods under unsupervised settings on several standard person re-identification datasets.
基金supported by the National Natural Science Foundation of China(Nos.62172315,62073262,and 61672429)the Fundamental Research Funds for the Central Universities,the Innovation Fund of Xidian University(No.20109205456)the Key Research and Development Program of Shaanxi(No.S2021-YF-ZDCXL-ZDLGY-0127),and HUAWEI.
文摘Free-viewpoint video allows the user to view objects from any virtual perspective,creating an immersive visual experience.This technology enhances the interactivity and freedom of multimedia performances.However,many free-viewpoint video synthesis methods hardly satisfy the requirement to work in real time with high precision,particularly for sports fields having large areas and numerous moving objects.To address these issues,we propose a freeviewpoint video synthesis method based on distance field acceleration.The central idea is to fuse multiview distance field information and use it to adjust the search step size adaptively.Adaptive step size search is used in two ways:for fast estimation of multiobject three-dimensional surfaces,and synthetic view rendering based on global occlusion judgement.We have implemented our ideas using parallel computing for interactive display,using CUDA and OpenGL frameworks,and have used real-world and simulated experimental datasets for evaluation.The results show that the proposed method can render free-viewpoint videos with multiple objects on large sports fields at 25 fps.Furthermore,the visual quality of our synthetic novel viewpoint images exceeds that of state-of-the-art neural-rendering-based methods.
基金supported by the Key R&D Programs of Zhejiang Province(Nos.2023C01224 and 2022C01220)the National Natural Science Foundation of China(No.61702458)+1 种基金Yun Zhang was partially supported by Zhejiang Province Public Welfare Technology Application Research(No.LGG22F020009)Key Lab of Film and TV Media Technology of Zhejiang Province(No.2020E10015).
文摘Computer-generated aesthetic patterns arewidely used as design materials in various fields. Themost common methods use fractals or dynamicalsystems as basic tools to create various patterns. Toenhance aesthetics and controllability, some researchershave introduced symmetric layouts along with thesetools. One popular strategy employs dynamical systemscompatible with symmetries that construct functionswith the desired symmetries. However, these aretypically confined to simple planar symmetries. Theother generates symmetrical patterns under theconstraints of tilings. Although it is slightly moreflexible, it is restricted to small ranges of tilingsand lacks textural variations. Thus, we proposed anew approach for generating aesthetic patterns bysymmetrizing quasi-regular patterns using general kuniformtilings. We adopted a unified strategy toconstruct invariant mappings for k-uniform tilings thatcan eliminate texture seams across the tiling edges.Furthermore, we constructed three types of symmetriesassociated with the patterns: dihedral, rotational, andreflection symmetries. The proposed method can beeasily implemented using GPU shaders and is highlyefficient and suitable for complicated tiling with regularpolygons. Experiments demonstrated the advantages of our method over state-of-the-art methods in terms offlexibility in controlling the generation of patterns withvarious parameters as well as the diversity of texturesand styles.
基金We thank the anonymous reviewers for their constructive comments.Weiwei Xu is partially supported by the National Natural Science Foundation of China(No.61732016).
文摘Inspired by the success of WaveNet in multi-subject speech synthesis,we propose a novel neural network based on causal convolutions for multi-subject motion modeling and generation.The network can capture the intrinsic characteristics of the motion of different subjects,such as the influence of skeleton scale variation on motion style.Moreover,after fine-tuning the network using a small motion dataset for a novel skeleton that is not included in the training dataset,it is able to synthesize high-quality motions with a personalized style for the novel skeleton.The experimental results demonstrate that our network can model the intrinsic characteristics of motions well and can be applied to various motion modeling and synthesis tasks.
基金This work is supported by National Natural Science Foundation of China(NSFC)under Grant No.61972010UTS–CSC Scholarship by the University of Technology Sydney and China Scholarship Council under Agreement No.201908200009.
文摘The visual modeling method enables flexible interactions with rich graphical depictions of data and supports the exploration of the complexities of epidemiological analysis.However,most epidemiology visualizations do not support the combined analysis of objective factors that might influence the transmission situation,resulting in a lack of quantitative and qualitative evidence.To address this issue,we developed a portrait-based visual modeling method called+msRNAer.This method considers the spatiotemporal features of virus transmission patterns and multidimensional features of objective risk factors in communities,enabling portrait-based exploration and comparison in epidemiological analysis.We applied+msRNAer to aggregate COVID-19-related datasets in New South Wales,Australia,combining COVID-19 case number trends,geo-information,intervention events,and expert-supervised risk factors extracted from local government area-based censuses.We perfected the+msRNAer workflow with collaborative views and evaluated its feasibility,effectiveness,and usefulness through one user study and three subject-driven case studies.Positive feedback from experts indicates that+msRNAer provides a general understanding for analyzing comprehension that not only compares relationships between cases in time-varying and risk factors through portraits but also supports navigation in fundamental geographical,timeline,and other factor comparisons.By adopting interactions,experts discovered functional and practical implications for potential patterns of long-standing community factors regarding the vulnerability faced by the pandemic.Experts confirmed that+msRNAer is expected to deliver visual modeling benefits with spatiotemporal and multidimensional features in other epidemiological analysis scenarios.
基金This work was supported in part by grants from the National Key R&D Program of China(2021YFC3300403)National Natural Science Foundation of China(62072382)Yango Charitable Foundation,and the National Science Foundation(OAC-2007661).
文摘It remains an interesting and challenging problem to synthesize a vivid and realistic singing face driven by music. In this paper, we present a method for this task with natural motions for the lips, facial expression, head pose, and eyes. Due to the coupling of mixed information for the human voice and backing music in common music audio signals, we design a decouple-and-fuse strategy to tackle the challenge. We first decompose the input music audio into a human voice stream and a backing music stream. Due to the implicit and complicated correlation between the two-stream input signals and the dynamics of the facial expressions, head motions, and eye states, we model their relationship with an attention scheme, where the effects of the two streams are fused seamlessly. Furthermore, to improve the expressivenes of the generated results, we decompose head movement generation in terms of speed and direction, and decompose eye state generation into short-term blinking and long-term eye closing, modeling them separately. We have also built a novel dataset, SingingFace, to support training and evaluation of models for this task, including future work on this topic. Extensive experiments and a user study show that our proposed method is capable of synthesizing vivid singing faces, qualitatively and quantitatively better than the prior state-of-the-art.
基金supported in part by the National Key R&D Program of China(2018AAA0102200)the National Natural Science Foundation of China(62002375,62002376,62325221,62132021).
文摘Template matching is a fundamental task in computer vision and has been studied for decades.It plays an essential role in manufacturing industry for estimating the poses of different parts,facilitating downstream tasks such as robotic grasping.Existing methods fail when the template and source images have different modalities,cluttered backgrounds,or weak textures.They also rarely consider geometric transformations via homographies,which commonly exist even for planar industrial parts.To tackle the challenges,we propose an accurate template matching method based on differentiable coarse-tofine correspondence refinement.We use an edge-aware module to overcome the domain gap between the mask template and the grayscale image,allowing robust matching.An initial warp is estimated using coarse correspondences based on novel structure-aware information provided by transformers.This initial alignment is passed to a refinement network using references and aligned images to obtain sub-pixel level correspondences which are used to give the final geometric transformation.Extensive evaluation shows that our method to be significantly better than state-of-the-art methods and baselines,providing good generalization ability and visually plausible results even on unseen real data.
基金supported by the National Key R&D Program of China(2020YFB1708900)the National Natural Science Foundation of China(62072271).
文摘We propose a unified 3D flow frameworkfor joint learning of shape embedding and deformationfor different categories. Our goal is to recovershapes from imperfect point clouds by fitting thebest shape template in a shape repository afterdeformation. Accordingly, we learn a shape embeddingfor template retrieval and a flow-based network forrobust deformation. We note that the deformationflow can be quite different for different shapecategories. Therefore, we introduce a novel multi-hubmodule to learn multiple modes of deformation toincorporate such variation, providing a network whichcan handle a wide range of objects from differentcategories. The shape embedding is designed to retrievethe best-fit template as the nearest neighbor in a latentspace. We replace the standard fully connected layerwith a tiny structure in the embedding that significantlyreduces network complexity and further improvesdeformation quality. Experiments show the superiorityof our method to existing state-of-the-art methods viaqualitative and quantitative comparisons. Finally, ourmethod provides efficient and flexible deformation thatcan further be used for novel shape design.