Self-supervised Learning for Speech Emotion Recognition Task Using Audio-visual Features and Distil Hubert Model on BAVED and RAVDESS Databases

导出

摘要 Existing pre-trained models like Distil HuBERT excel at uncovering hidden patterns and facilitating accurate recognition across diverse data types, such as audio and visual information. We harnessed this capability to develop a deep learning model that utilizes Distil HuBERT for jointly learning these combined features in speech emotion recognition (SER). Our experiments highlight its distinct advantages: it significantly outperforms Wav2vec 2.0 in both offline and real-time accuracy on RAVDESS and BAVED datasets. Although slightly trailing HuBERT’s offline accuracy, Distil HuBERT shines with comparable performance at a fraction of the model size, making it an ideal choice for resource-constrained environments like mobile devices. This smaller size does come with a slight trade-off: Distil HuBERT achieved notable accuracy in offline evaluation, with 96.33% on the BAVED database and 87.01% on the RAVDESS database. In real-time evaluation, the accuracy decreased to 79.3% on the BAVED database and 77.87% on the RAVDESS database. This decrease is likely a result of the challenges associated with real-time processing, including latency and noise, but still demonstrates strong performance in practical scenarios. Therefore, Distil HuBERT emerges as a compelling choice for SER, especially when prioritizing accuracy over real-time processing. Its compact size further enhances its potential for resource-limited settings, making it a versatile tool for a wide range of applications.

作者 Karim Dabbabi Abdelkarim Mars

机构地区 Research Unite of Analyse and Processing of Electrical and Energetic Systems Research Laboratory in Algebra

出处《Journal of Systems Science and Systems Engineering》 SCIE EI CSCD 2024年第5期576-606,共31页 系统科学与系统工程学报（英文版）

关键词 Wav2vec 2.0 Distil HuBERT HuBERT SER audio and audio-visual features

分类号 TP3 [自动化与计算机技术—计算机科学与技术]

引文网络
相关文献

1Ze-Hao Wang,Tong-Tian Weng,Xiang-Dong Chen,Li Zhao,Fang-Wen Sun.SSL Depth: self-supervised learning enables 16× speedup in confocal microscopy-based 3D surface imaging [Invited][J].Chinese Optics Letters,2024,22(6):3-7.
2ZHOU Cheng,LIU Yang,QIU Yingwei,HE Daijun,YAN Yu,LUO Min,LEI Youyuan.Self-supervised learning artificial intelligence noise reduction technology based on the nearest adjacent layer in ultra-low dose CT of urinary calculi[J].中国医学影像技术,2024,40(8):1249-1253.
3无.宗教教友会中心英国伦敦[J].世界建筑导报,2024,39(4):106-108.
4李巧君,郭彍.基于改进K均值聚类的语音情感识别深度学习方法[J].计算机应用与软件,2024,41(9):224-229.
5杨晓东.休伯特·德雷福斯现象学批判理论及其当代意义[J].湖州职业技术学院学报,2024,22(2):36-41.
6GAO Yuan,WU Zixuan,SHENG Boyang,ZHANG Fu,CHENG Yong,YAN Junfeng,PENG Qinghua.The enlightenment of artificial intelligence large-scale model on the research of intelligent eye diagnosis in traditional Chinese medicine[J].Digital Chinese Medicine,2024,7(2):101-107.
7Yuhang Li,Jingxi Li,Aydogan Ozcan.Nonlinear encoding in diffractive information processing using linear optical materials[J].Light(Science & Applications),2024,13(8):1675-1688.
8殷波.基于改进YOLOv8的轻量化火灾检测算法[J].计算机科学与应用,2024,14(9):47-55.
9Jiang Wu,Yi Shi,Shun Yan,Hong-Mei Yan.Global-local combined features to detect pain intensity from facial expression images with attention mechanism[J].Journal of Electronic Science and Technology,2024,22(3):80-93.
10Lihong Diao,Xinyi Fan,Jiang Yu,Kai Huang,Edouard C.Nice,Chao Liu,Dong Li,Shuzhen Guo.TCM-HIN2Vec:A strategy for uncovering biological basis of heart qi deficiency pattern based on network embedding and transcriptomic experiment[J].Journal of Traditional Chinese Medical Sciences,2024,11(3):264-274.

Journal of Systems Science and Systems Engineering

2024年第5期

浏览历史

内容加载中请稍等...

Self-supervised Learning for Speech Emotion Recognition Task Using Audio-visual Features and Distil Hubert Model on BAVED and RAVDESS Databases

相关作者

相关机构

相关主题

浏览历史