摘要
连续手语识别研究中,视觉特征的有效表示是提升识别效果的关键。然而,手语动作时序长度的差异性及手语弱标注现象,使得有效的视觉特征提取更加困难。针对上述问题,提出了多尺度视觉特征提取及跨模态对齐的连续手语识别方法(MECA)。该方法主要包含多尺度视觉特征提取模型和跨模态对齐约束。在多尺度视觉特征提取模型中,并行地融合具备不同扩张因子的瓶颈残差结构,来丰富多尺度时序感受野,用于提取不同时序长度的手语视觉特征,同时采用层级复用设计进一步强化视觉特征表示。在跨模态对齐约束中,采用动态时间规整建模手语视觉特征和文本特征之间的内在联系,其中,文本特征提取由多层感知机和长短期记忆网络协作实现。在具备挑战性的公开数据集RWTH-2014、RWTH-2014T、CSL-Daily上进行实验,结果表明所提方法达到目前具有竞争力的性能。上述实验验证了所提的采用多尺度的方式可以捕捉不同时序长度的手语动作,以及构建跨模态对齐约束的思路是正确且有效的,适用于弱监督条件下的连续手语识别任务。
Effective representation of visual feature extraction is the key to improving continuous sign language rec-ognition performance.However,the differences in the temporal length of sign language actions and the sign lan-guage weak annotation problem make effective visual feature extraction more difficult.To focus on the above prob-lems,a method named multi-scale visual feature extraction and cross-modality alignment for continuous sign lan-guage recognition(MECA)is proposed.The method mainly consists of a multi-scale visual feature extraction module and cross-modal alignment constraints.Specifically,in the multi-scale visual feature extraction module,the bottleneck residual structures with different dilated factors are fused in parallel to enrich the multi-scale temporal receptive field for extracting visual features with different temporal lengths.Furthermore,the hierarchical reuse design is adopted to further strengthen the visual feature.In the cross-modality alignment constraint,dynamic time warping is used to model the intrinsic relationship between sign language visual features and textual features,where textual feature ex-traction is achieved by the collaboration of a multilayer perceptron and a long short-term memory network.Experi-ments performed on the challenging public datasets RWTH-2014,RWTH-2014T and CSL-Daily show that the pro-posed method achieves competitive performance.The results demonstrate that the multi-scale approach proposed in MECA can capture sign language actions of distinct temporal lengths,and constructing the cross-modal alignment constraint is correct and effective for continuous sign language recognition under weak supervision.
作者
郭乐铭
薛万利
袁甜甜
GUO Leming;XUE Wanli;YUAN Tiantian(School of Computer Science and Engineering,Tianjin University of Technology,Tianjin 300384,China;Technical College for the Deaf,Tianjin University of Technology,Tianjin 300384,China)
出处
《计算机科学与探索》
CSCD
北大核心
2024年第10期2762-2769,共8页
Journal of Frontiers of Computer Science and Technology
基金
国家自然科学基金(62376197,62020106004,92048301)
天津市研究生科研创新项目(2021YJSB244)
天津市科技计划项目(23JCYBJC00360)。
关键词
连续手语识别
多尺度
跨模态对齐约束
视频视觉特征
文本特征
continuous sign language recognition
multi-scale
cross-modal alignment constraints
video visual fea-tures
text features