期刊文献+

融合字形特征的多任务老挝语文字识别后纠错

Error Correction After Multi-task Lao Character Recognition Fusing Glyph Features
下载PDF
导出
摘要 后处理是检测和纠正文字识别后文本中错误的重要步骤,老挝语文字识别结果中存在大量相似字符替换错误及字符断裂、粘连导致的字符插入、删除错误,针对该问题进行分析,该文提出了一种融合字符形状特征的多任务老挝语文字识别后纠错方法.该方法引入基于长短期记忆网络的seq2seq模型架构,将老挝字形特征融入模型以辅助模型对相似字符替换错误的纠正,针对文本中插入、删除错误在编码端联合多尺度卷积网络以不同的卷积核大小提取文本的局部特征;再使用语言模型对解码端预测的文本序列与原始文本进行重排名,得到最佳候选;同时,采用多任务学习的方式,以错误检测辅任务优化模型纠错效果,此外,该文以数据增强的方式扩充数据集.实验结果表明,该方法使老挝文字识别的字符错率低至7.94%. Post-processing is an important step to detect and correct errors in the text after text recognition.There are a large number of similar character substitution errors and character insertion and deletion errors caused by character breaks and adhesions in the Lao character recognition results.The problems were studied in this paper and then an error correction method after multi-task Lao character recognition fusing glyph features was put forward.The method proposed in this paper introduced the seq2seq model architecture based on the long and short-term memory network, and integrated Lao glyph features into the model to assist the model in correcting similar character replacement errors.Multi-scale convolutional network was combined on the encoding side to extract the local features of the text with different convolution kernel sizes for the insertion and deletion errors.Language model was then used to re-rank the text sequence predicted by the decoder and the original text to obtain the best candidate.Meanwhile, the multi-task learning was used to optimize the error correction effect with the auxiliary task of error detection.In addition, this paper expanded the data set in a data-enhanced manner.The results show that the method proposed in this paper makes the character error rate of Lao character recognition as low as 7.94%.
作者 杨志婥琪 周兰江 周蕾越 YANGZhi Chuo-qi;ZHOU Lan-jiang;ZHOU Lei-yue(Faculty of Information Engineering and Automation,Kunming University of Science and Technology,Kunming 650500,China;Faculty of Electronics and Information Engineering,Oxbridge College,Kunming University of Science and Technology,Kunming 650106,China)
出处 《小型微型计算机系统》 CSCD 北大核心 2023年第3期506-513,共8页 Journal of Chinese Computer Systems
基金 国家自然科学基金项目(61662040)资助。
关键词 老挝文字识别后处理 Seq2seq 多任务学习 字形特征 Lao text recognition post-processing Seq2seq multi-task learning glyph features
  • 相关文献

参考文献3

二级参考文献4

共引文献24

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部