期刊文献+
共找到6篇文章
< 1 >
每页显示 20 50 100
基于Wav2vec2.0与语境情感信息补偿的对话语音情感识别
1
作者 曹荣贺 吴晓龙 +4 位作者 冯畅 郑方 徐明星 哈妮克孜·伊拉洪 艾斯卡尔·艾木都拉 《信号处理》 CSCD 北大核心 2023年第4期698-707,共10页
情感在人际交互中扮演着重要的角色。在日常对话中,一些语句往往存在情感色彩较弱、情感类别复杂、模糊性高等现象,使对话语音情感识别成为一项具有挑战性的任务。针对该问题,现有很多工作通过对全局对话进行情感信息检索,将全局情感信... 情感在人际交互中扮演着重要的角色。在日常对话中,一些语句往往存在情感色彩较弱、情感类别复杂、模糊性高等现象,使对话语音情感识别成为一项具有挑战性的任务。针对该问题,现有很多工作通过对全局对话进行情感信息检索,将全局情感信息用于预测。然而,当对话中前后的话语情感变化较大时,不加选择的引入前文情感信息容易给当前预测带来干扰。本文提出了基于Wav2vec2.0与语境情感信息补偿的方法,旨在从前文中选择与当前话语最相关的情感信息作为补偿。首先通过语境信息补偿模块从历史对话中选择可能对当前话语情感影响最大的话语的韵律信息,利用长短时记忆网络将韵律信息构建为语境情感信息补偿表征。然后,利用预训练模型Wav2vec2.0提取当前话语的嵌入表征,将嵌入表征与语境表征融合用于情感识别。本方法在IEMOCAP数据集上的识别性能为69.0%(WA),显著超过了基线模型。 展开更多
关键词 情感识别 二元对话 情感补偿 wav2vec2.0
下载PDF
基于Wav2vec2.0神经网络的轨道交通钢轨损伤压电阵列超声导波定位方法 被引量:1
2
作者 刘思昊 钱鲁斌 +1 位作者 梅曜华 邢宇辉 《城市轨道交通研究》 北大核心 2023年第6期101-105,110,共6页
鉴于普通超声波检测方法无法实现对轨道交通钢轨的长距离检测,基于超声导波的SHM(结构健康监测)技术难以从响应信号中提取损伤特征而影响损伤定位精度,提出了一种基于Wav2vec2.0神经网络的压电阵列超声导波定位方法对轨道交通钢轨损伤... 鉴于普通超声波检测方法无法实现对轨道交通钢轨的长距离检测,基于超声导波的SHM(结构健康监测)技术难以从响应信号中提取损伤特征而影响损伤定位精度,提出了一种基于Wav2vec2.0神经网络的压电阵列超声导波定位方法对轨道交通钢轨损伤进行定位。基于压电阵列超声导波数据的特点,对该方法进行了简要介绍。搭建了钢轨损伤的超声导波检测系统,并利用该系统进行数据集的采集。采用ABAQUS有限元软件建立钢轨损伤超声导波检测三维有限元模型,并利用该模型进行数据集的采集。利用小波信号处理方法对超声导波试验信号进行重构,以达到信号去噪的目的;在仿真信号中加入随机噪声,将叠加随机噪声后的超声导波仿真信号作为补充数据集;通过计算模型中钢轨损伤定位的准确率和误差对模型的性能进行评估。结果表明,当迭代轮次达到第120次时,训练样本的准确率达到100%。利用基于Wav2vec 2.0神经网络的压电阵列超声导波定位方法可实现轨道交通钢轨损伤的准确定位。 展开更多
关键词 轨道交通 钢轨损伤 压电阵列超声导波定位方法 wav2vec2.0神经网络
下载PDF
Using Speaker-Specific Emotion Representations in Wav2vec 2.0-Based Modules for Speech Emotion Recognition
3
作者 Somin Park Mpabulungi Mark +1 位作者 Bogyung Park Hyunki Hong 《Computers, Materials & Continua》 SCIE EI 2023年第10期1009-1030,共22页
Speech emotion recognition is essential for frictionless human-machine interaction,where machines respond to human instructions with context-aware actions.The properties of individuals’voices vary with culture,langua... Speech emotion recognition is essential for frictionless human-machine interaction,where machines respond to human instructions with context-aware actions.The properties of individuals’voices vary with culture,language,gender,and personality.These variations in speaker-specific properties may hamper the performance of standard representations in downstream tasks such as speech emotion recognition(SER).This study demonstrates the significance of speaker-specific speech characteristics and how considering them can be leveraged to improve the performance of SER models.In the proposed approach,two wav2vec-based modules(a speaker-identification network and an emotion classification network)are trained with the Arcface loss.The speaker-identification network has a single attention block to encode an input audio waveform into a speaker-specific representation.The emotion classification network uses a wav2vec 2.0-backbone as well as four attention blocks to encode the same input audio waveform into an emotion representation.These two representations are then fused into a single vector representation containing emotion and speaker-specific information.Experimental results showed that the use of speaker-specific characteristics improves SER performance.Additionally,combining these with an angular marginal loss such as the Arcface loss improves intra-class compactness while increasing inter-class separability,as demonstrated by the plots of t-distributed stochastic neighbor embeddings(t-SNE).The proposed approach outperforms previous methods using similar training strategies,with a weighted accuracy(WA)of 72.14%and unweighted accuracy(UA)of 72.97%on the Interactive Emotional Dynamic Motion Capture(IEMOCAP)dataset.This demonstrates its effectiveness and potential to enhance human-machine interaction through more accurate emotion recognition in speech. 展开更多
关键词 Attention block IEMOCAP dataset speaker-specific representation speech emotion recognition wav2vec 2.0
下载PDF
基于wav2vec预训练的样例关键词识别 被引量:4
4
作者 李昭奇 黎塔 《计算机科学》 CSCD 北大核心 2022年第1期59-64,共6页
样例关键词识别是将语音关键词片段与语音流中的片段匹配的任务。在低资源或零资源的情况下,样例关键词识别通常采用基于动态时间规正的方法。近年来,神经网络声学词嵌入已成为一种常用的样例关键词识别方法,但神经网络的方法受限于标... 样例关键词识别是将语音关键词片段与语音流中的片段匹配的任务。在低资源或零资源的情况下,样例关键词识别通常采用基于动态时间规正的方法。近年来,神经网络声学词嵌入已成为一种常用的样例关键词识别方法,但神经网络的方法受限于标注数据数量。使用wav2vec预训练可以减少神经网络对数据量的依赖,提升系统的性能。使用wav2vec模型提取的预训练特征直接替换梅尔频率倒谱系数特征后,在SwitchBoard语料库中提取的数据集上使双向长短时记忆网络的神经网络声学词嵌入系统的平均准确率提高了11.1%,等精度召回值提高了10.0%。将wav2vec特征与梅尔频率倒谱系数特征相融合以提取嵌入向量的方法进一步提高了系统的性能,与仅使用wav2vec的方法相比,融合方法的平均准确率提高了5.3%,等精度召回值提高了2.5%。 展开更多
关键词 声学词嵌入 孤立词识别 wav2vec预训练 样例查询 语音片段查询
下载PDF
Self-Diffuser:语音驱动人脸表情的技术研究
5
作者 臧梦利 王少波 +1 位作者 智宇 陈昂 《计算机科学与应用》 2024年第8期236-249,共14页
先前的语音驱动面部表情的动画研究从音频信号中产生了较为逼真和精确的嘴唇运动和面部表情。传统的方法主要集中在学习从语音到动画的确定性映射,最近的研究开始探讨语音驱动的3D人脸动画的多样性,即通过利用扩散模型的多样性能力来捕... 先前的语音驱动面部表情的动画研究从音频信号中产生了较为逼真和精确的嘴唇运动和面部表情。传统的方法主要集中在学习从语音到动画的确定性映射,最近的研究开始探讨语音驱动的3D人脸动画的多样性,即通过利用扩散模型的多样性能力来捕捉音频和面部运动之间复杂的多对多关系来完成任务。本文的Self-Diffuser方法使用预训练的大语言模型wav2vec 2.0对音频输入进行编码,通过引入基于扩散的技术,将其与Transformer相结合来完成生成任务。本研究不仅克服了传统回归模型在生成具有唇读可理解性的真实准确唇运动方面的局限性,还探讨了精确的嘴唇同步和创造与语音无关的面部表情之间的权衡。通过对比、分析当前最先进的方法,本文的Self-Diffuser方法,使得语音驱动的面部动画产生了更精确的唇运动;在与说话松散相关的上半部表情方面也产生了更贴近于真实说话表情的面部运动;同时本文模型引入的扩散机制使得生成3D人脸动画序列的多样性能力也大大提高。Previous research on speech-driven facial expression animation has achieved realistic and accurate lip movements and facial expressions from audio signals. Traditional methods primarily focused on learning deterministic mappings from speech to animation. Recent studies have started exploring the diversity of speech-driven 3D facial animation, aiming to capture the complex many-to-many relationships between audio and facial motion by leveraging the diversity capabilities of diffusion models. In this study, the Self-Diffuser method is proposed by utilizing the pre-trained large-scale language model wav2vec 2.0 to encode audio inputs. By introducing diffusion-based techniques and combining them with Transformers, the generation task is accomplished. This research not only overcomes the limitations of traditional regression models in generating lip movements that are both realistic and lip-reading comprehensible, but also explores the trade-off between precise lip synchronization and creating facial expressions independent of speech. Through comparisons and analysis with the current state-of-the-art methods, the Self-Diffuser method in this paper achieves more accurate lip movements in speech-driven facial animation. It also produces facial motions that closely resemble real speaking expressions in the upper face region correlated with speech looseness. Additionally, the introduced diffusion mechanism significantly enhances the diversity capabilities in generating 3D facial animation sequences. 展开更多
关键词 wav2vec 2.0 TRANSFORMER 扩散机制 语音驱动 面部动画
下载PDF
Self-supervised Learning for Speech Emotion Recognition Task Using Audio-visual Features and Distil Hubert Model on BAVED and RAVDESS Databases
6
作者 Karim Dabbabi Abdelkarim Mars 《Journal of Systems Science and Systems Engineering》 SCIE EI CSCD 2024年第5期576-606,共31页
Existing pre-trained models like Distil HuBERT excel at uncovering hidden patterns and facilitating accurate recognition across diverse data types, such as audio and visual information. We harnessed this capability to... Existing pre-trained models like Distil HuBERT excel at uncovering hidden patterns and facilitating accurate recognition across diverse data types, such as audio and visual information. We harnessed this capability to develop a deep learning model that utilizes Distil HuBERT for jointly learning these combined features in speech emotion recognition (SER). Our experiments highlight its distinct advantages: it significantly outperforms Wav2vec 2.0 in both offline and real-time accuracy on RAVDESS and BAVED datasets. Although slightly trailing HuBERT’s offline accuracy, Distil HuBERT shines with comparable performance at a fraction of the model size, making it an ideal choice for resource-constrained environments like mobile devices. This smaller size does come with a slight trade-off: Distil HuBERT achieved notable accuracy in offline evaluation, with 96.33% on the BAVED database and 87.01% on the RAVDESS database. In real-time evaluation, the accuracy decreased to 79.3% on the BAVED database and 77.87% on the RAVDESS database. This decrease is likely a result of the challenges associated with real-time processing, including latency and noise, but still demonstrates strong performance in practical scenarios. Therefore, Distil HuBERT emerges as a compelling choice for SER, especially when prioritizing accuracy over real-time processing. Its compact size further enhances its potential for resource-limited settings, making it a versatile tool for a wide range of applications. 展开更多
关键词 wav2vec 2.0 Distil HuBERT HuBERT SER audio and audio-visual features
原文传递
上一页 1 下一页 到第
使用帮助 返回顶部