摘要
老挝语属于低资源语言,文本语料稀缺使得老挝语自然语言处理的基础任务难以开展,而老挝语的光学字符识别研究在一定程度上能解决语料匮乏的问题.该文提出一种融合老挝语词法、字符向量等文本特征的老挝语文字识别方法.首先,该方法以具有残差结构的卷积神经网络为主干,加入卷积注意力模块,以提取老挝文字图片的图片特征信息;其次,通过注意力机制动态分配权重组合图片特征信息与Glove预训练的词向量及字符向量;再有,用双向长短期记忆网络编码组合特征,以预测老挝文字序列标签的真实分布,同时,融入老挝音节组成规则,以预测音节规则标签分支优化老挝文字识别模型;最后,采用连接时序分类对标签分布进行序列对齐.实验结果表明,该方法取得了较好的老挝文字识别效果,准确率达到了88.63%.
Lao is a low-resource language,the text corpus is scarce makes it difficult to carry out the basic tasks of natural language processing in Lao,but the research of optical character recognition in Lao can solve the problem of scarcity of text corpus to a certain extent.A Lao optical character recognition method that integrates Lao morphology,character vectors and other text features is proposed in this paper.Firstly,the method uses a convolutional neural network with a residual structure as the backbone,adds a convolutional attention module to extract the image feature information of Lao text pictures,and then use attention mechanism to dynamically assign weights to combine the image feature information with the word vector and character vector pre-trained by Glove.Secondly,the bidirectional long-term short-term memory network is used to code combination feature and predict the true distribution of Lao text sequence labels.Meanwhile,Lao syllable composition rules are integrated into model to predict the syllable rule labels to optimize Lao character recognition model.Finally,the Connected Time Series Classification is used to perform sequence alignment on the label distribution.The results show that this method has achieved better Lao character recognition results,the accuracy reached 88.63%.
作者
杨志婥琪
周兰江
周蕾越
YANGZhi Chuo-qi;ZHOU Lan-jiang;ZHOU Lei-yue(Faculty of Information Engineering and Automation,Kunming University of Science and Technology,Kunming 650500,China;Faculty of Electronics and Information Engineering,Oxbridge College,Kunming University of Science and Technology,Kunming 650106,China)
出处
《小型微型计算机系统》
CSCD
北大核心
2022年第4期723-730,共8页
Journal of Chinese Computer Systems
基金
国家自然科学基金项目(61662040)资助。
关键词
老挝文字识别
词法特征
特征融合
注意力机制
残差结构
Lao text recognition
lexical features
feature fusion
attention mechanism
residual structure