摘要
目前婴儿哭泣检测领域单模态方法的识别精度难以提升,而婴幼儿相关的视频数据日益增加,在此背景下论文提出一种音视频融合的双模态方法检测婴儿哭泣,来达到进一步提高婴儿哭泣识别率目的。论文首先制作复杂环境下婴儿哭泣和非哭泣二分类的音视频数据集,并基于该数据集设计7种对比实验与CNN-3DCNN+LSTM音视频融合网络进行比较。实验表明该融合方法 F1-score分数达到了93.2%,相比较单模态最优分数高5.3%、多模态网络基准线高4.3%。证明了音视频融合方法在婴儿哭泣识别领域可行性。
At present,the recognition accuracy of unimodal methods in the field of infant crying detection is difficult to im-prove,and the video data related to infants is increasing.Based on this context,this paper proposes a CNN+3DCNN+LSTM au-dio-video fusion bimodal method to detect infant crying and further improve the recognition rate of infant crying.This paper first pro-duces audio-video datasets of crying and non-crying bimodal infants in complex environments,and designs seven comparison exper-iments based on this dataset to compare with CNN+3DCNN+LSTM fusion networks.The experiments show that this fusion method achieves an F1-score of 93.2%,which is 5.3%higher than the unimodal optimal score and 4.3%higher than the multimodal net-work baseline.It proves the feasibility of CNN+3DCNN+LSTM audio-video fusion method in the field of infant crying recognition.
作者
刘朋
周娴玮
龚启旭
余松森
LIU Peng;ZHOU Xianwei;GONG Qixu;YU Songsen(School of Software,South China Normal University,Foshan 528225)
出处
《计算机与数字工程》
2023年第7期1534-1539,共6页
Computer & Digital Engineering
关键词
婴幼儿哭泣
音视频融合
深度学习
多模态网络
infants crying
audio and video fusion
deep learning
multimodal network