The performance of deep learning models is heavily reliant on the quality and quantity of train-ing data.Insufficient training data will lead to overfitting.However,in the task of alert-situation text classification,i...The performance of deep learning models is heavily reliant on the quality and quantity of train-ing data.Insufficient training data will lead to overfitting.However,in the task of alert-situation text classification,it is usually difficult to obtain a large amount of training data.This paper proposes a text data augmentation method based on masked language model(MLM),aiming to enhance the generalization capability of deep learning models by expanding the training data.The method em-ploys a Mask strategy to randomly conceal words in the text,effectively leveraging contextual infor-mation to predict and replace masked words based on MLM,thereby generating new training data.Three Mask strategies of character level,word level and N-gram are designed,and the performance of each Mask strategy under different Mask ratios is analyzed and studied.The experimental results show that the performance of the word-level Mask strategy is better than the traditional data augmen-tation method.展开更多
基金Supported by the Humanities and Social Sciences Research Project of the Ministry of Education(No.22YJA840004).
文摘The performance of deep learning models is heavily reliant on the quality and quantity of train-ing data.Insufficient training data will lead to overfitting.However,in the task of alert-situation text classification,it is usually difficult to obtain a large amount of training data.This paper proposes a text data augmentation method based on masked language model(MLM),aiming to enhance the generalization capability of deep learning models by expanding the training data.The method em-ploys a Mask strategy to randomly conceal words in the text,effectively leveraging contextual infor-mation to predict and replace masked words based on MLM,thereby generating new training data.Three Mask strategies of character level,word level and N-gram are designed,and the performance of each Mask strategy under different Mask ratios is analyzed and studied.The experimental results show that the performance of the word-level Mask strategy is better than the traditional data augmen-tation method.