摘要
针对民航安全信息自动分类应用中存在样本量不足的问题,利用基于变换器的双向编码器表示技术(BERT,bidirectional encoder representations from transformers)预训练模型和简单数据增强(EDA,easy data augment)算法对中国民用航空安全信息系统中的不安全事件信息按单事件类型的事件信息数量级划分出不同数量级的子集,构造支持向量机(SVM,support vector machine)自动分类器,分析不同数量级对应的数据集在模型上的性能表现及在小数据集上的性能提升。结果表明,加权F_(1)(F_(1w))在单事件类型事件数量为十数量级时提升31.21%,百数量级时提升9.66%,千数量级时提升3.35%。该方法在相对较小的样本集上训练的分类器效果较好,可用于民航安全信息自动分类。
Aiming at the problem of insufficient samples in the application of automatic classification of civil aviation safety information,an automatic classifier is developed,based on the bidirectional encoder representations from transformers(BERT)pre-training model,easy data augment(EDA)and support vector machine(SVM)algorithms.This paper categorizes the incident information into data subsets of different orders of magnitude according to the orders of magnitude of information of single incident type and analyzes the model performance of data-sets in different scale,especially for the small data-sets.The results show that the F1(F1w)value is increased by 31.21%,9.66%and 3.35%when the orders of magnitude of single incident type are ten-scale,hundred-scale and thousand-scale respectively,indicating a significant improvement of model performance.Therefore,by text enhancement algorithm,the automatic classifier trained in relatively small data-sets has a good effect,and is applicable to automatic classification and processing for civil aviation safety information.
作者
崔振新
张卓言
CUI Zhenxin;ZHANG Zhuoyan(College of Flight Technology,CAUC,Tianjin300300,China)
出处
《中国民航大学学报》
CAS
2022年第3期47-53,64,共8页
Journal of Civil Aviation University of China
基金
民航安全能力建设资金项目(ASSA2020/12)。
关键词
民航安全
安全信息
文本增强
自然语言处理
civil aviation safety
safety information
text augment
natural language processing