期刊文献+

ChpoBERT:面向中文政策文本的预训练模型

ChpoBERT:A Pre-trained Model for Chinese Policy Texts
下载PDF
导出
摘要 随着深度学习的迅速发展和领域数据的快速积累,领域化的预训练模型在知识组织和挖掘中发挥了越来越重要的支撑作用。面向海量的中文政策文本,结合相应的预训练策略构建中文政策文本预训练模型,不仅有助于提升中文政策文本智能化处理的水平,而且为政策文本数据驱动下的精细化和多维度分析与探究奠定了坚实的基础。面向国家级、省级和市级平台上的政策文本,通过自动抓取和人工辅助相结合的方式,在去除非政策文本的基础上,确定了131390份政策文本,总字数为305648206。面向所构建的中文政策文本语料库,基于BERT-base-Chinese和Chinese-RoBERTa-wwm-ext,本研究利用MLM(masked language model)和WWM(whole word masking)任务构建了中文政策文本预训练模型(ChpoBERT),并在Github上对该模型进行了开源。在困惑度评价指标和政策文本自动分词、词性自动标注、命名实体识别下游任务上,ChpoBERT系列模型均表现出了较优的性能,可为政策文本的智能知识挖掘提供领域化的基础计算资源支撑。 With the rapid development of deep learning and the accumulation of domain data,domain-based pre-trained models play an increasingly important supporting role in knowledge organization and mining.Aimed at massive Chinese policy texts,the pre-trained model of Chinese policy texts combined with the corresponding pre-trained strategies not only helps to improve the level of intelligent processing of Chinese policy texts,but also lays a solid foundation for the refinement,multi-dimensional analysis,and exploration of policy texts driven by data.For the national,provincial,and municipal policy texts,131,390 policy texts with a total number of 305,648,206 Chinese words were obtained through the combination of automatic capture and manual assistance by removing non-policy text.This study develops a Chinese policy text pre-training model(ChpoBERT)for the constructed Chinese policy text corpus,which is based on the Chinese-RoBERTa-wwm-ext and BERT-base-Chinese.The model is open source and is available on Github.In terms of the evaluation indices of perplexity and downstream tasks of automatic word segmentation,automatic part-of-speech tagging,and named entity recognition of policy texts,the constructed ChpoBERT models showed better performance,which can provide basic computing resource support for the domain of intelligent knowledge mining of policy texts.
作者 沈思 陈猛 冯暑阳 许乾坤 刘江峰 王飞 王东波 Shen Si;Chen Meng;Feng Shuyang;Xu Qiankun;Liu Jiangfeng;Wang Fei;Wang Dongbo(School of Economics&Management,Nanjing University of Science&Technology,Nanjing 210094;College of Information Management,Nanjing Agricultural University,Nanjing 210095;Jiangsu Institute of Science and Technology Information,Nanjing 210042)
出处 《情报学报》 CSCD 北大核心 2023年第12期1487-1497,共11页 Journal of the China Society for Scientific and Technical Information
基金 国家自然科学基金面上项目“基于深度学习的学术全文本知识图谱构建及检索研究”(71974094)。
关键词 BERT 预训练模型 政策文本 深度学习 困惑度 BERT pre-trained model policy text deep learning perplexity
  • 相关文献

参考文献17

二级参考文献204

共引文献333

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部