摘要
BERT(Bidirectional Encoder Representation from Transformers)预训练语言模型在对越南语分词时会去掉越南语音节的声调,导致语法错误检测模型在训练过程中会丢失部分语义信息。针对该问题,提出了一种融合越南语词性和声调特征的方法来补全输入音节的语义信息。由于越南语的标注语料稀缺,语法错误检测任务面临训练数据规模不足的问题。针对该问题,设计了一种由正确语料生成大量错误文本的数据增强算法。在越南语维基百科和新闻语料上的实验结果表明,所提方法在测试集上取得了最高的F和F分数,证明该方法可提高检测效果,并且随着生成数据规模的扩大,该方法与基线模型方法的效果都得到了逐步提升,从而证明了所提数据增强算法的有效性。
The BERT pre-trained language model removes the tones of the syllables when segmenting Vietnamese words,which leads to the loss of some semantic information during the training process of grammatical error detection model.To address this problem,an approach combining part of speech and tonal features is proposed to complete the semantic information of the input syllables.Grammatical error detection task confronts the problem of insufficient training data due to the scarcity of labeled Vietnamese data.To address this problem,a data augmentation algorithm is designed to generate a large number of error texts from the correct corpus.Experimental results on Vietnamese Wikipedia and news corpus show that the proposed method achieves the highest Fand Fscore on the test set,which proves it improves the detection performance.Both the proposed method and the baseline model method have a gradual improvement with the increasing scales of the generated data,which proves that the proposed data augmentation algorithm is effective.
作者
张洲
朱俊国
余正涛
ZHANG Zhou;ZHUJun-guo;YU Zheng-tao(School of Information Engineering and Automation,Kunming University of Scienceand Technology,Kunming 650500,China;Yunnan Key Laboratory of Artificial Intelligence,Kunming University of Scienceand Technology,Kunming 650500,China)
出处
《计算机科学》
CSCD
北大核心
2022年第11期221-227,共7页
Computer Science
基金
国家自然科学基金(62166022,61732005,61866020)
云南省重大科技专项计划(202002AD080001,202103AA080015)
云南省科技厅面上项目(202101AT070077)
云南省人培项目(KKSY201903018)。
关键词
预训练语言模型
越南语语法错误检测
特征融合
数据增强
Pre-trained language model
Vietnamese grammatical error detection
Feature fusion
Data augmentation