多尺度视觉特征提取及跨模态对齐的连续手语识别

Multi-scale Visual Feature Extraction and Cross-Modality Alignment for Continuous Sign Language Recognition

下载PDF

导出

摘要连续手语识别研究中,视觉特征的有效表示是提升识别效果的关键。然而,手语动作时序长度的差异性及手语弱标注现象,使得有效的视觉特征提取更加困难。针对上述问题,提出了多尺度视觉特征提取及跨模态对齐的连续手语识别方法(MECA)。该方法主要包含多尺度视觉特征提取模型和跨模态对齐约束。在多尺度视觉特征提取模型中,并行地融合具备不同扩张因子的瓶颈残差结构,来丰富多尺度时序感受野,用于提取不同时序长度的手语视觉特征,同时采用层级复用设计进一步强化视觉特征表示。在跨模态对齐约束中,采用动态时间规整建模手语视觉特征和文本特征之间的内在联系,其中,文本特征提取由多层感知机和长短期记忆网络协作实现。在具备挑战性的公开数据集RWTH-2014、RWTH-2014T、CSL-Daily上进行实验,结果表明所提方法达到目前具有竞争力的性能。上述实验验证了所提的采用多尺度的方式可以捕捉不同时序长度的手语动作,以及构建跨模态对齐约束的思路是正确且有效的,适用于弱监督条件下的连续手语识别任务。 Effective representation of visual feature extraction is the key to improving continuous sign language rec-ognition performance.However,the differences in the temporal length of sign language actions and the sign lan-guage weak annotation problem make effective visual feature extraction more difficult.To focus on the above prob-lems,a method named multi-scale visual feature extraction and cross-modality alignment for continuous sign lan-guage recognition(MECA)is proposed.The method mainly consists of a multi-scale visual feature extraction module and cross-modal alignment constraints.Specifically,in the multi-scale visual feature extraction module,the bottleneck residual structures with different dilated factors are fused in parallel to enrich the multi-scale temporal receptive field for extracting visual features with different temporal lengths.Furthermore,the hierarchical reuse design is adopted to further strengthen the visual feature.In the cross-modality alignment constraint,dynamic time warping is used to model the intrinsic relationship between sign language visual features and textual features,where textual feature ex-traction is achieved by the collaboration of a multilayer perceptron and a long short-term memory network.Experi-ments performed on the challenging public datasets RWTH-2014,RWTH-2014T and CSL-Daily show that the pro-posed method achieves competitive performance.The results demonstrate that the multi-scale approach proposed in MECA can capture sign language actions of distinct temporal lengths,and constructing the cross-modal alignment constraint is correct and effective for continuous sign language recognition under weak supervision.

作者郭乐铭薛万利袁甜甜 GUO Leming;XUE Wanli;YUAN Tiantian(School of Computer Science and Engineering,Tianjin University of Technology,Tianjin 300384,China;Technical College for the Deaf,Tianjin University of Technology,Tianjin 300384,China)

机构地区天津理工大学计算机科学与工程学院天津理工大学聋人工学院

出处《计算机科学与探索》 CSCD 北大核心 2024年第10期2762-2769,共8页 Journal of Frontiers of Computer Science and Technology

基金国家自然科学基金(62376197,62020106004,92048301) 天津市研究生科研创新项目(2021YJSB244) 天津市科技计划项目(23JCYBJC00360)。

关键词连续手语识别多尺度跨模态对齐约束视频视觉特征文本特征 continuous sign language recognition multi-scale cross-modal alignment constraints video visual fea-tures text features

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

1孟巾凯,彭健钧,肖智东,郭立,金凯,郑彤.模块化连续手语识别算法及技术综述[J].小型微型计算机系统,2024,45(10):2428-2441.
2王娜,黄庆幸,左小明.数字技术驱动制造集群网络协作机制研究[J].全国流通经济,2024(13):155-159.
3马弘财,秦浩峰,曾江勇,金美多吉,王冬经,元振杰.牦牛源金黄色葡萄球菌的分离鉴定及生物学特性研究[J].中国人兽共患病学报,2024,40(7):662-669.
4李新,宋刘广,孙钰奇,曾佳全.基于深度可分离卷积的异常驱动视频异常检测[J].软件导刊,2024,23(10):187-192.
5王露露,徐增敏,张雪莲,蒙儒省,卢涛.跨视图时序对比学习的自监督视频表征算法[J].计算机工程与应用,2024,60(18):158-166.
6高鲲,张皓洋,李达,闫野,印二威.基于特征分离的复杂环境三维手部姿态估计算法研究[J].智能安全,2024,3(3):54-65.
7黄譞孜.论我国公平竞争审查监督机制的困境与出路[J].秦智,2024(8):0031-0033.
8李成严,郑企森,王昊.多元化渐进域迁移弱监督实时目标检测[J].哈尔滨理工大学学报,2024,29(3):11-19.
9周煊超,张晖,刘颖,刘小霞.动态地震监测系统设计及数据处理技术优化[J].粘接,2024,51(7):125-127.
10陈晓丽,甘立双.乡村振兴背景下基于CBBE模型的林果区域品牌建设研究——以新疆特色林果为例[J].中国果树,2024(10):128-136.

计算机科学与探索

2024年第10期

浏览历史

内容加载中请稍等...

多尺度视觉特征提取及跨模态对齐的连续手语识别

相关作者

相关机构

相关主题

浏览历史