Binaural rendering is of great interest to virtual reality and immersive media. Although humans can naturally use their two ears to perceive the spatial information contained in sounds, it is a challenging task for ma...Binaural rendering is of great interest to virtual reality and immersive media. Although humans can naturally use their two ears to perceive the spatial information contained in sounds, it is a challenging task for machines to achieve binaural rendering since the description of a sound field often requires multiple channels and even the metadata of the sound sources. In addition, the perceived sound varies from person to person even in the same sound field. Previous methods generally rely on individual-dependent head-related transferred function(HRTF)datasets and optimization algorithms that act on HRTFs. In practical applications, there are two major drawbacks to existing methods. The first is a high personalization cost, as traditional methods achieve personalized needs by measuring HRTFs. The second is insufficient accuracy because the optimization goal of traditional methods is to retain another part of information that is more important in perception at the cost of discarding a part of the information. Therefore, it is desirable to develop novel techniques to achieve personalization and accuracy at a low cost. To this end, we focus on the binaural rendering of ambisonic and propose 1) channel-shared encoder and channel-compared attention integrated into neural networks and 2) a loss function quantifying interaural level differences to deal with spatial information. To verify the proposed method, we collect and release the first paired ambisonic-binaural dataset and introduce three metrics to evaluate the content information and spatial information accuracy of the end-to-end methods. Extensive experimental results on the collected dataset demonstrate the superior performance of the proposed method and the shortcomings of previous methods.展开更多
In recent years,addressing ill-posed problems by leveraging prior knowledge contained in databases on learning techniques has gained much attention.In this paper,we focus on complete three-dimensional(3D)point cloud r...In recent years,addressing ill-posed problems by leveraging prior knowledge contained in databases on learning techniques has gained much attention.In this paper,we focus on complete three-dimensional(3D)point cloud reconstruction based on a single red-green-blue(RGB)image,a task that cannot be approached using classical reconstruction techniques.For this purpose,we used an encoder-decoder framework to encode the RGB information in latent space,and to predict the 3D structure of the considered object from different viewpoints.The individual predictions are combined to yield a common representation that is used in a module combining camera pose estimation and rendering,thereby achieving differentiability with respect to imaging process and the camera pose,and optimization of the two-dimensional prediction error of novel viewpoints.Thus,our method allows end-to-end training and does not require supervision based on additional ground-truth(GT)mask annotations or ground-truth camera pose annotations.Our evaluation of synthetic and real-world data demonstrates the robustness of our approach to appearance changes and self-occlusions,through outperformance of current state-of-the-art methods in terms of accuracy,density,and model completeness.展开更多
针对多视图立体网络在弱纹理或非朗伯曲面等挑战性区域重建效果差的问题,首先提出一个基于3个并行扩展卷积和注意力机制的多尺度特征提取模块,在增加感受野的同时捕获特征之间的依赖关系以获取全局上下文信息,从而提升多视图立体网络在...针对多视图立体网络在弱纹理或非朗伯曲面等挑战性区域重建效果差的问题,首先提出一个基于3个并行扩展卷积和注意力机制的多尺度特征提取模块,在增加感受野的同时捕获特征之间的依赖关系以获取全局上下文信息,从而提升多视图立体网络在挑战性区域特征的表征能力以进行鲁棒的特征匹配。其次在代价体正则化3D CNN部分引入注意力机制,使网络注意于代价体中的重要区域以进行平滑处理。另外建立一个神经渲染网络,该网络利用渲染参考损失精确地解析辐射场景表达的几何外观信息,并引入深度一致性损失保持多视图立体网络与神经渲染网络之间的几何一致性,有效地缓解有噪声代价体对多视图立体网络的不利影响。该算法在室内DTU数据集中测试,点云重建的完整性和整体性指标分别为0.289和0.326,与基准方法CasMVSNet相比,分别提升24.9%和8.2%,即使在挑战性区域也得到高质量的重建效果;在室外Tanks and Temples中级数据集中,点云重建的平均F-score为60.31,与方法UCS-Net相比提升9.9%,体现出较强的泛化能力。展开更多
基金supported in part by the National Natural Science Foundation of China (62176059, 62101136)。
文摘Binaural rendering is of great interest to virtual reality and immersive media. Although humans can naturally use their two ears to perceive the spatial information contained in sounds, it is a challenging task for machines to achieve binaural rendering since the description of a sound field often requires multiple channels and even the metadata of the sound sources. In addition, the perceived sound varies from person to person even in the same sound field. Previous methods generally rely on individual-dependent head-related transferred function(HRTF)datasets and optimization algorithms that act on HRTFs. In practical applications, there are two major drawbacks to existing methods. The first is a high personalization cost, as traditional methods achieve personalized needs by measuring HRTFs. The second is insufficient accuracy because the optimization goal of traditional methods is to retain another part of information that is more important in perception at the cost of discarding a part of the information. Therefore, it is desirable to develop novel techniques to achieve personalization and accuracy at a low cost. To this end, we focus on the binaural rendering of ambisonic and propose 1) channel-shared encoder and channel-compared attention integrated into neural networks and 2) a loss function quantifying interaural level differences to deal with spatial information. To verify the proposed method, we collect and release the first paired ambisonic-binaural dataset and introduce three metrics to evaluate the content information and spatial information accuracy of the end-to-end methods. Extensive experimental results on the collected dataset demonstrate the superior performance of the proposed method and the shortcomings of previous methods.
基金Supported by National Natural Science Foundation of China(Grant No.51935003).
文摘In recent years,addressing ill-posed problems by leveraging prior knowledge contained in databases on learning techniques has gained much attention.In this paper,we focus on complete three-dimensional(3D)point cloud reconstruction based on a single red-green-blue(RGB)image,a task that cannot be approached using classical reconstruction techniques.For this purpose,we used an encoder-decoder framework to encode the RGB information in latent space,and to predict the 3D structure of the considered object from different viewpoints.The individual predictions are combined to yield a common representation that is used in a module combining camera pose estimation and rendering,thereby achieving differentiability with respect to imaging process and the camera pose,and optimization of the two-dimensional prediction error of novel viewpoints.Thus,our method allows end-to-end training and does not require supervision based on additional ground-truth(GT)mask annotations or ground-truth camera pose annotations.Our evaluation of synthetic and real-world data demonstrates the robustness of our approach to appearance changes and self-occlusions,through outperformance of current state-of-the-art methods in terms of accuracy,density,and model completeness.
文摘针对多视图立体网络在弱纹理或非朗伯曲面等挑战性区域重建效果差的问题,首先提出一个基于3个并行扩展卷积和注意力机制的多尺度特征提取模块,在增加感受野的同时捕获特征之间的依赖关系以获取全局上下文信息,从而提升多视图立体网络在挑战性区域特征的表征能力以进行鲁棒的特征匹配。其次在代价体正则化3D CNN部分引入注意力机制,使网络注意于代价体中的重要区域以进行平滑处理。另外建立一个神经渲染网络,该网络利用渲染参考损失精确地解析辐射场景表达的几何外观信息,并引入深度一致性损失保持多视图立体网络与神经渲染网络之间的几何一致性,有效地缓解有噪声代价体对多视图立体网络的不利影响。该算法在室内DTU数据集中测试,点云重建的完整性和整体性指标分别为0.289和0.326,与基准方法CasMVSNet相比,分别提升24.9%和8.2%,即使在挑战性区域也得到高质量的重建效果;在室外Tanks and Temples中级数据集中,点云重建的平均F-score为60.31,与方法UCS-Net相比提升9.9%,体现出较强的泛化能力。