摘要
针对图像描述任务,传统算法更加关注图像中的视觉物体,而忽略了文本信息对其描述也起到了不可或缺的作用。为增强对图像中文本信息的提取,提出了一种基于多模态特征融合的图像描述算法。在视觉特征提取的基础上,增加文本识别和检测算法,并使用多模态Transformer来融合两种模态。在解码阶段,采用中心图作为指导模块,使用动态指针网络实现迭代解码,使模型生成更加丰富的自然描述语句。最后在Textcaps数据集上的实验结果表明,该方法可以有效提高文本区域中OCR令牌的提取精度。
For the image description task,traditional algorithms focus more on the visual objects in the image and ignore the fact that textual information also plays an indispensable role in its description.To enhance the extraction of text information in images,this paper proposes an image description algorithm based on multimodal feature fusion.Based on visual feature extraction,text recognition and detection algorithms are added,and a multimodal Transformer is used to fuse the two modalities.In the decoding stage,a central graph is used as a guiding module and a dynamic pointer network is used to achieve iterative decoding so that the model generates richer natural descriptive statements.Finally,the experimental results on Textcaps dataset show that the proposed method can effectively improve the extraction accuracy of OCR tokens in text regions.
出处
《工业控制计算机》
2023年第1期87-88,91,共3页
Industrial Control Computer