期刊文献+

视觉注意与语义感知联合推理实现场景文本识别

Joint Inference of Visual Attention and Semantic Perception for Scene Text Recognition
下载PDF
导出
摘要 场景中的不规则文本识别仍然是一个具有挑战性的问题。针对场景中的任意形状以及低质量文本,本文提出了融合视觉注意模块与语义感知模块的多模态网络模型。视觉注意模块采用基于并行注意的方式,与位置感知编码结合提取图像的视觉特征。基于弱监督学习的语义感知模块用于学习语言信息以弥补视觉特征的缺陷,采用基于Transformer的变体,通过随机遮罩单词中的一个字符进行训练提高模型的上下文语义推理能力。视觉语义融合模块通过选通机制将不同模态的信息进行交互以产生用于字符预测的鲁棒特征。通过大量的实验证明,所提出的方法可以有效地对任意形状和低质量的场景文本进行识别,并且在多个基准数据集上获得了具有竞争力的结果。特别地,对于包含低质量文本的数据集SVT和SVTP,识别准确率分别达到了93.6%和86.2%。与只使用视觉模块的模型相比,准确率分别提升了3.5%和3.9%,充分表明了语义信息对于文本识别的重要性。 Irregular text recognition in scenes is still a challenging problem.For arbitrary shapes and low‑quality text in scenes,this paper proposes a multimodal network that combines a visual attention module and a semantic perception module.The visual attention module uses a parallel attention-based approach to extract visual features of images combined with positional encoding.The semantic perception module based on weak supervised learning is used to learn linguistic information to compensate for the deficiencies of visual features.The module uses a Transformer-based variant that improves the model’s contextual semantic inference by randomly masking a character in a word for training.The visual semantic fusion module interacts information from different modalities through a gating mechanism to generate robust features for character prediction.The proposed approach is demonstrated through extensive experiments to be effective in recognizing arbitrarily shaped and low-quality scene text,and competitive results are obtained on several benchmark datasets.In particular,accuracy rates of 93.6%and 86.2%are achieved for the datasets SVT and SVTP,which contain low-quality text,respectively.Compared with the method containing only the visual module,the accuracy is improved by 3.5%and 3.9%,respectively,which fully demonstrates the importance of semantic information for text recognition.
作者 佟国香 董田荣 胡珩彰 TONG Guoxiang;DONG Tianrong;HU Hengzhang(College of Optical‑Electrical and Computer Engineering,University of Shanghai for Science and Technology,Shanghai 200093,China)
出处 《数据采集与处理》 CSCD 北大核心 2023年第3期665-675,共11页 Journal of Data Acquisition and Processing
基金 国家重点研发计划(2018YFB1700902)。
关键词 场景文本识别 不规则文本 视觉注意模块 语义感知模块 多模态 scene text recognition irregular text visual attention module semantic perception module multimodal
  • 相关文献

参考文献1

二级参考文献3

共引文献26

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部