摘要
为缓解现有说话人识别模型提取的说话人特征可靠性不强,融合特征时不同尺度特征关联性不高的问题,研究一种基于分层注意力特征融合网络(hierarchical attention feature fusion network,HAFF-Net)的说话人识别算法。利用卷积和池化操作对经过预处理的语音特征进行下采样,降低特征的维度;将提取的特征输入到分层注意力特征融合模块(hiera-rchical attention feature fusion block,HAFFB)中,利用平均协调注意力(mean coordinate attention,MCA)增强说话人特征的可靠性,利用注意力特征融合模块(attention feature fusion,AFF)捕获多尺度互补特征;采用统计池化和全连接层提取说话人的嵌入特征,应用附加角裕度损失函数(AAM-Softmax)端到端优化模型。研究结果表明,所提算法可以有效增强特征表达的可靠性,成功捕获了多尺度特征之间的差异,提高了说话人识别的性能。
To alleviate the problems that the speaker features extracted using existing speaker recognition models are not reliable enough and features are less correlated at different scales when fusing features,a speaker recognition algorithm based on hiera-rchical attention feature fusion network(HAFF-Net)was studied.The pre-processed speech features were down-sampled using convolution and pooling operations to compress the speech feature dimensions.The extracted features were inputted into the hierarchical attention feature fusion block(HAFFB),which enhanced the reliability of the speaker features by utilizing mean coordinate attention(MCA),and attention feature fusion(AFF)was used to capture multi-scale complementary features.Statistical pooling and full connectivity were used to extract speaker embedding features,and additional angular margin loss function(AAM-Softmax)was applied to optimize the model end-to-end.Results show that the proposed algorithm effectively enhances the reliability of feature representation and successfully captures the variability among multi-scale features,thus significantly improving the speaker recognition performance.
作者
赵宏
高楠
王伟杰
杨昌东
ZHAO Hong;GAO Nan;WANG Wei-jie;YANG Chang-dong(School of Computer and Communication,Lanzhou University of Technology,Lanzhou 730050,China;Information Technology Management Department,Postal Savings Bank of China Gansu Branch,Lanzhou 730030,China)
出处
《计算机工程与设计》
北大核心
2024年第11期3413-3419,共7页
Computer Engineering and Design
基金
国家自然科学基金项目(62166025)
甘肃省重点研发计划基金项目(21YF5GA073)。
关键词
说话人识别
分层注意力
平均协调注意力
注意力特征融合
多尺度特征
附加角裕度损失函数
端到端
speaker recognition
hierarchical attention
mean coordinate attention
attention feature fusion
multi-scale features
additive angular margin loss
end-to-end