摘要
[目的/意义]针对专题数据库文本资源主题相近、语义相似度高、知识聚敛度高等特点,提出一种基于预训练模型与Blending集成学习策略的专题数据库文本分类模型。[方法/过程]选择BERT、ERNIE、RoBERTa、ALBERT、XLNet预训练模型提取专题文本的多层次特征,基于Blending集成学习方法组合预训练模型,爬取“新华丝路”专题数据库相关文本资源对集成学习模型的有效性与优越性进行验证。[结果/结论]结果显示,与单模型、传统集成学习方法相比,在专题数据库服务场景下,基于Blending集成学习的文本分类模型具有较高的分类性能。
[Purpose/significance] According to the characteristics,such as similar topics,high semantic similarity and high knowledge convergence of text resources from thematic database,this paper proposed a classification model of text resources in thematic database based on pre-training model and blending ensemble learning method.[Method/process] BERT,ERNIE,RoBERTa,ALBERT and XLNet pre-training models were selected to extract multi-level features of thematic texts,by combining with the pre-training model based on the blending ensemble learning method,and the relevant text resources of “Silk Road News” thematic database were crawled to verify the effectiveness and superiority of the integrated learning model.[Result/conclusion] The results show that,compared with single model and traditional ensemble learning methods,the text classification model based on Blending ensemble learning has higher classification performance in the service scene of thematic database.
出处
《情报理论与实践》
CSSCI
北大核心
2022年第10期169-175,共7页
Information Studies:Theory & Application
基金
国家社会科学基金重大项目“人文社科专题数据库建设规范化管理研究”的成果之一,项目编号:18ZDA326。
关键词
专题文本分类
集成学习
专题数据库
预训练模型
thematic text classification
ensemble learning
thematic database
pre-training model