摘要
目的:支撑大规模中文专利精准自动分类工作,利用改进中文专利文本表示的预训练语言模型实现专利的自动分类。方法:基于中文预训练语言模型RoBERTa,在大规模中文发明专利语料上分别使用单字遮盖策略和全词遮盖策略遮盖语言模型任务进行迁移学习,得到改进中文专利文本表示的RoBERTa模型(ZL-RoBERTa)和RoBERTa-wwm模型(ZL-RoBERTa-wwm);将模型应用到专利文本分类任务中进行实验研究,并与典型深度学习模型(Word2Vec+BiGRU+ATT+TextCNN)和当前先进的预训练语言模型BERT、RoBERTa进行对比分析。结果:基于ZL-RoBERTa和ZL-RoBERTa-wwm的中文专利自动分类模型在专利文本分类任务上的分类精准率/召回率/F1值更为突出。结论:改进文本表示的中文专利预训练语言模型用于专利文本分类具有更优效果,这为后续专利情报工作中应用预训练模型提供了模型基础。
Objective To support the accurate automatic classification of large-scale Chinese patents,this paper explored the use of pre-trained language models that improved the text representation of Chinese patents to achieve automatic classification.Methods Based on the Chinese RoBERTa model,the RoBERTa model(ZL-RoBERTa)and RoBERTa-wwm model(ZL-RoBERTa-wwm)for improving the Chinese Patent text representation are obtained by using the Masked Language Model tasks of Single-word Masking strategy and Whole Word Masking strategy respectively for transfer learning on a large-scale Chinese invention patent corpus.The model was applied to the patent text classification tasks for experimental study and compared with typical deep learning models(Word2Vec+BiGRU+ATT+TextCNN)and current state-of-the-art pre-trained language models BERT and RoBERTa for analysis.Results The classification Precision/Recall/F1 values of ZL-RoBERTa-based and ZL-RoBERTa-wwm-based Chinese patent automatic classification models were more outstanding on patent text classification tasks.Conclusion The Chinese patent pre-trained language model with improved text representation is more effective for patent text classification,which provides a model basis for the subsequent application of pre-trained language models in patent intelligence work.
作者
马俊
吕璐成
赵亚娟
李聪颖
MA Jun;LV Lu-cheng;ZHAO Ya-juan;LI Cong-ying(Information Research Center of Military Sciences,Academy of Military Sciences,Beijing 100142,China;National Science Library,Chinese Academy of Sciences,Beijing 100190,China)
出处
《中华医学图书情报杂志》
CAS
2022年第11期20-28,共9页
Chinese Journal of Medical Library and Information Science
关键词
中文专利
文本表示
预训练语言模型
文本分类
Chinese patent
Text representation
Pre-trained language model
Text classification