期刊文献+

基于空间域和频率域特征融合的场景文本识别

Scene Text Recognition Based on Feature Fusion in Space Domain and Frequency Domain
下载PDF
导出
摘要 对于小样本语言无关场景的文本识别,现有的方法往往面临鲁棒性低和泛化能力差的问题。针对这一问题,一方面,在特征提取阶段,提出了基于空间域和频率域特征融合的双流网络结构,其包含一个提取空间域特征的深度残差卷积网络分支,以及提取频率域特征的一维快速傅里叶变换和浅层神经网络分支,接着使用通道注意力机制融合这两种特征。另一方面,在序列建模阶段,针对语言无关场景的特点,提出一种多尺度一维卷积模块用来代替双向长短期记忆网络。然后结合现有的TPS矫正模块和CTC解码器搭建完整模型。训练过程中采用了迁移学习的方法,先在大型英文数据集上进行预训练,后在目标数据集上进行微调。在文中整理的两个小样本语言无关数据集上的实验结果表明,所提模型在准确率上优于现有的模型,验证了其在该场景下的具有较高的鲁棒性和泛化能力;此外,在语言相关场景的5个基准数据集上的相关实验(不用微调)表明,使用文中所述特征提取模块的方法优于对比的基线方法,证明了所提出的双流特征融合网络的有效性和通用性。 Existing scene text recognition methods often face the problems of low robustness and poor generalization ability in the few-shot and language-independent scene.To solve this problem,on the one hand,a dual-stream network structure based on the fusion of space domain and frequency domain features is proposed in the feature extraction stage.It consists of a deep residual convolutional network branch for extracting spatial domain features,and a shallow neural network with one-dimensional fast fourier transform(FFT)branch for extracted frequency features.And then apply the channel attention mechanism to fuse the two features.On the other hand,in the sequence modeling stage,a multi-scale one-dimensional convolution module is proposed to replace the bidirectional long short-term memory(BiLSTM)according to the characteristics of the language-independent scene.Finally,a complete model is built by combining the existing TPS rectification module and CTC decoder.The transfer learning me-thod is adopted in the training process.Pre-training is performed on the large English datasets first,and then fine-tuning is performed on the target datasets.Experimental results on two few-shot language-independent datasets compiled in the paper show that the method is superior to the existing methods in terms of accuracy,which verifies that it has high robustness and generalization ability in this scenario.Moreover,the method using the feature extraction module described in the paper is better than the baseline on the five benchmark datasets of language-dependent scene(no fine-tuning),which verifies the effectiveness and versati-lity of the dual-stream feature fusion network proposed in the paper.
作者 霍华骑 陆璐 HUO Huaqi;LU Lu(School of Computer Science and Engineering,South China University of Technology,Guangzhou 510006,China;PENGCHENG Laboratory,Shenzhen,Guangdong 518055,China)
出处 《计算机科学》 CSCD 北大核心 2023年第S02期36-43,共8页 Computer Science
基金 广东省重点领域研究计划(2022B0101070001)。
关键词 深度学习 场景文本识别 双流网络 频率域分支 小样本 Deep learning Scene text recognition Dual-stream network Frequency domain branch Few-shot
  • 相关文献

参考文献1

二级参考文献3

共引文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部