摘要
近年来,预训练模型在自然语言处理领域蓬勃发展,旨在对自然语言隐含的知识进行建模和表示,但主流预训练模型大多针对英文领域。中文领域起步相对较晚,鉴于其在自然语言处理过程中的重要性,学术界和工业界都开展了广泛的研究,提出了众多的中文预训练模型。文中对中文预训练模型的相关研究成果进行了较为全面的回顾,首先介绍预训练模型的基本概况及其发展历史,对中文预训练模型主要使用的两种经典模型Transformer和BERT进行了梳理,然后根据不同模型所属类别提出了中文预训练模型的分类方法,并总结了中文领域的不同评测基准,最后对中文预训练模型未来的发展趋势进行了展望。旨在帮助科研工作者更全面地了解中文预训练模型的发展历程,继而为新模型的提出提供思路。
In recent years,pre-training models have flourished in the field of natural language processing,aiming at modeling and representing the implicit knowledge of natural language.However,most of the mainstream pre-training models target at the English domain,and the Chinese domain starts relatively late.Given its importance in the natural language processing process,extensive research has been conducted in both academia and industry,and numerous Chinese pre-training models have been proposed.This paper presents a comprehensive review of the research results related to Chinese pre-training models,firstly introducing the basic overview of pre-training models and their development history,then sorting out the two classical models Transformer and BERT that are mainly used in Chinese pre-training models,then proposing a classification method for Chinese pre-training models according to model categories,and summarizes the different evaluation benchmarks in the Chinese domain.Finally,the future development trend of Chinese pre-training models is prospected.It aims to help researchers to gain a more comprehensive understanding of the development of Chinese pre-training models,and then to provide some ideas for the proposal of new models.
作者
侯钰涛
阿布都克力木·阿布力孜
哈里旦木·阿布都克里木
HOU Yu-tao;ABULIZI Abudukelimu;ABUDUKELIMU Halidanmu(School of Information Management,Xinjiang University of Finance and Economics,Urumqi 830012,China)
出处
《计算机科学》
CSCD
北大核心
2022年第7期148-163,共16页
Computer Science
基金
国家自然科学基金(61866035,61966033)。
关键词
中文预训练模型
自然语言处理
词向量
预处理
深度学习
Chinese pre-training models
Natural language processing
Word embedding
Pre-training
Deep learning