摘要
知识蒸馏是一种常用于解决BERT等深度预训练模型规模大、推断慢等问题的模型压缩方案。采用“多教师蒸馏”的方法,可以进一步提高学生模型的表现,而传统的对教师模型中间层采用的“一对一”强制指定的策略会导致大部分的中间特征被舍弃。提出了一种“单层对多层”的映射方式,解决了知识蒸馏时中间层无法对齐的问题,帮助学生模型掌握教师模型中间层中的语法、指代等知识。在GLUE中的若干数据集的实验表明,学生模型在保留了教师模型平均推断准确率的93.9%的同时,只占用了教师模型平均参数规模的41.5%。
Knowledge distillation is a model compression scheme commonly used to solve the problems of large scale and slow inference of BERT constant depth pre-training model.The method of"multi-teacher distillation"can further improve the performance of the student model,while the traditional"one-to-one"mapping method mandatory assignment strategy for the middle layer of the teacher model will lead to the abandonment of most of the middle features.The"one-to-many"mapping method is proposed to solve the problem that the middle layer cannot be aligned during knowledge distillation,and help students master the grammar,reference and other knowledge in the middle layer of the teacher model.Experiments on several data sets in GLUE show that the student model retains 93.9%of the average inference accuracy of the teacher model,while only accounting for 41.5%of the average parameter size of the teacher model.
作者
石佳来
郭卫斌
SHI Jiaai;GUO Weibin(School of Information Science and Engineering,East China University of Science and Technology,Shanghai 200237,China)
出处
《大数据》
2024年第3期119-132,共14页
Big Data Research
基金
国家自然科学基金项目(No.62076094)。
关键词
深度预训练模型
BERT
多教师蒸馏
自然语言理解
deep pre-training model
BERT
multi-teacher distillation
nature language understanding