摘要
针对不平衡文本分类问题中少数类样本在分类器上预测精度低的问题,提出了一种基于改进的Focal Loss损失函数和EDA(Easy Data Augmentation)文本增强技术的不平衡文本分类算法。在训练数据层面利用EDA文本增强技术对小样本数据进行增强;考虑到样本训练难易程度的动态变化,改进了Focal Loss损失函数平衡因子参数的设定方式;接着利用增强后的数据和改进后的损失函数结合较为简单且保留文本语序信息的DCNN模型进行分类模型的训练。在搜狗新闻数据集上,控制相同的参数进行对比实验,结果表明EDA技术和改进的Focal loss损失函数对于不平衡问题都有一定的改善作用,综合应用两种技术的算法获得了最好的表现。
Focusing on the problem that the minority class in the unbalanced dataset has low prediction accuracy by traditional text classifiers,an unbalanced text classification algorithm based on improved Focal loss and EDA was proposed.Firstly,EDA was used to enhance the small sample data at the level of training data.Secondly,the seting method of balance factor parameters of Focal loss function was improved due to the dynamic change of the training difficulty.Then the augmented data and the improved loss function were used to train the classification model with the DCNN model which is relatively simple and retains the word order information.On Sogou Labs's news data,the same parameters were controlled to carry out comparative experiments.Experimental results show that EDA technology and the improved Focal Loss both make some improvement on the imbalance problem,and the proposed classification algorithm which combines the two has the best performance in the experiment.
作者
王雯慧
靳大尉
WANG Wen-hui;JIN Da-wei(Army Engineering University of PLA,Command and Control Engineering College,Nanjing Jiangsu 210000,China)
出处
《计算机仿真》
北大核心
2023年第4期346-349,396,共5页
Computer Simulation
基金
国家自然科学基金(61806221)。
关键词
不平衡文本
文本增强
代价敏感
分类算法
Unbalanced text(UT)
Data augmentation
Cost-sensitiveness
Classification algorithm