摘要
现有的语音-人脸跨模态关联学习方法在语义关联和监督信息方面仍然面临挑战,尚未充分考虑语音与人脸之间的语义信息交互。为解决这些问题,提出一种基于多模态共享网络的自监督关联学习方法。首先,将语音和人脸模态的特征映射到单位球面,构建一个公共的特征空间;接着,通过多模态共享网络的残差块来挖掘复杂的非线性数据关系,并利用其中权重共享的全连接层来增强语音与人脸特征向量之间的关联性;最后,使用K均值聚类算法生成的伪标签作为监督信号来指导度量学习,从而完成4种跨模态关联学习任务。实验结果表明,本文提出的方法在语音-人脸跨模态验证、匹配和检索任务上均取得了良好的效果,多项评价指标相较于现有基线方法提升1%~4%的准确率。
Existing voice-face cross-modal association learning methods still face challenges in semantic correlation and supervised information,and have not yet fully considered the semantic information interaction between voice and face.To solve these problems,a self-supervised association learning method based on a multi-modal shared network was proposed.Firstly,the voice and face features were mapped into the unit sphere to establish a shared feature space.Secondly,complex nonlinear data relationships were explored using the residual block of the multi-modal shared network,while a weight-sharing fully connected layer was utilized to enhance the correlation between voice and face.Finally,pseudo-labels,generated by the K-means clustering algorithm,were utilized as supervised signals,guiding the metric learning process to accomplish the four cross-modal association learning tasks.Experimental results show that the method proposed in this paper achieves favorable outcomes in voice-face cross-modal verification,matching,and retrieval tasks,and several evaluation metrics improve 1%~4%accuracy compared with existing baseline methods.
作者
李俊屿
卜凡亮
谭林
周禹辰
毛璟仪
LI Jun-yu;BU Fan-liang;TAN Lin;ZHOU Yu-chen;MAO Jing-yi(School of Information Network Security,People's Public Security University of China,Beijing 100038,China;First Research Institution of the Ministry of Public Security of PRC,Beijing 100048,China)
出处
《科学技术与工程》
北大核心
2024年第7期2804-2812,共9页
Science Technology and Engineering
基金
中国人民公安大学安全防范工程双一流专项(2023SYL08)。
关键词
语音-人脸跨模态
多模态共享网络
伪标签
关联学习
voice-face cross-modal
multi-modal shared network
pseudo label
association learning