期刊文献+

基于预训练语言模型的中文专利自动分类研究 被引量:1

Research on automatic classification of Chinese patents based on pre-trained language models
下载PDF
导出
摘要 目的:支撑大规模中文专利精准自动分类工作,利用改进中文专利文本表示的预训练语言模型实现专利的自动分类。方法:基于中文预训练语言模型RoBERTa,在大规模中文发明专利语料上分别使用单字遮盖策略和全词遮盖策略遮盖语言模型任务进行迁移学习,得到改进中文专利文本表示的RoBERTa模型(ZL-RoBERTa)和RoBERTa-wwm模型(ZL-RoBERTa-wwm);将模型应用到专利文本分类任务中进行实验研究,并与典型深度学习模型(Word2Vec+BiGRU+ATT+TextCNN)和当前先进的预训练语言模型BERT、RoBERTa进行对比分析。结果:基于ZL-RoBERTa和ZL-RoBERTa-wwm的中文专利自动分类模型在专利文本分类任务上的分类精准率/召回率/F1值更为突出。结论:改进文本表示的中文专利预训练语言模型用于专利文本分类具有更优效果,这为后续专利情报工作中应用预训练模型提供了模型基础。 Objective To support the accurate automatic classification of large-scale Chinese patents,this paper explored the use of pre-trained language models that improved the text representation of Chinese patents to achieve automatic classification.Methods Based on the Chinese RoBERTa model,the RoBERTa model(ZL-RoBERTa)and RoBERTa-wwm model(ZL-RoBERTa-wwm)for improving the Chinese Patent text representation are obtained by using the Masked Language Model tasks of Single-word Masking strategy and Whole Word Masking strategy respectively for transfer learning on a large-scale Chinese invention patent corpus.The model was applied to the patent text classification tasks for experimental study and compared with typical deep learning models(Word2Vec+BiGRU+ATT+TextCNN)and current state-of-the-art pre-trained language models BERT and RoBERTa for analysis.Results The classification Precision/Recall/F1 values of ZL-RoBERTa-based and ZL-RoBERTa-wwm-based Chinese patent automatic classification models were more outstanding on patent text classification tasks.Conclusion The Chinese patent pre-trained language model with improved text representation is more effective for patent text classification,which provides a model basis for the subsequent application of pre-trained language models in patent intelligence work.
作者 马俊 吕璐成 赵亚娟 李聪颖 MA Jun;LV Lu-cheng;ZHAO Ya-juan;LI Cong-ying(Information Research Center of Military Sciences,Academy of Military Sciences,Beijing 100142,China;National Science Library,Chinese Academy of Sciences,Beijing 100190,China)
出处 《中华医学图书情报杂志》 CAS 2022年第11期20-28,共9页 Chinese Journal of Medical Library and Information Science
关键词 中文专利 文本表示 预训练语言模型 文本分类 Chinese patent Text representation Pre-trained language model Text classification
  • 相关文献

参考文献8

二级参考文献48

  • 1陈筱芳.“春秋五霸”质疑与四霸之成功[J].西南民族大学学报(人文社会科学版),1992,13(5):83-88. 被引量:2
  • 2丁月华,文贵华,郭炜强.基于核向量空间模型的专利分类[J].华南理工大学学报(自然科学版),2005,33(8):58-61. 被引量:12
  • 3郭炜强,文军,文贵华.基于贝叶斯模型的专利分类[J].计算机工程与设计,2005,26(8):1986-1987. 被引量:13
  • 4李程雄,丁月华,文贵华.SVM-KNN组合改进算法在专利文本分类中的应用[J].计算机工程与应用,2006,42(20):193-195. 被引量:23
  • 5邓擘,樊孝忠,杨立公.基于统计分布与集合论的文本分类方法[J].北京理工大学学报,2006,26(7):589-592. 被引量:2
  • 6Yoon B, Park Y. A Systematic Approach for Identifying Technolo Opportunities: Keyword - based Morphology Analysis[ Jl. Techno- logical Forecasting and Social Change, 2005, 72 ( 2 ) : 145 - 160.
  • 7Shih M J, Liu D R, Hsu M L. Discovering Competitive Intelligence by Mining Changes in Patent Trends[ Jl. Expert Systems with Ap- plications, 2010, 37 (4) :2882 - 2890.
  • 8赵环宇.中文专利自动分类技术的研究[D].沈阳:沈阳航空工业学院,2009.
  • 9Mathiassen H, Ortiz - Arroyo D. Automatic Classification of Patent Applications Using Classifier Combinations[ C 3. in: Proceedings of the 7th International Conference on Intelligent Data Engineeriag and Automated Learning, Burgos, Spain. 2006 : 1039 - 1047.
  • 10Sahon G, Wong A, Yang C S. A Vector Space Model for Automatic Indexing [ J 1- Communications of the ACM, 1975,18 ( 11 ) : 613 - 620.

共引文献109

同被引文献10

引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部