期刊文献+

问答式林业预训练语言模型ForestBERT

Question-answering Forestry Pre-trained Language Model:ForestBERT
下载PDF
导出
摘要 【目的】针对林业文本利用率低、通用领域预训练语言模型对林业知识理解不足以及手动标注数据耗时费力等问题,基于大量林业文本,提出一种融合林业领域知识的预训练语言模型,并通过自动标注训练数据,高效实现林业抽取式问答,为林业决策管理提供智能化信息服务。【方法】首先,基于网络爬虫技术构建包含术语、法律法规和文献3个主题的林业语料库,使用该语料库对通用领域预训练语言模型BERT进行继续预训练,再通过掩码语言模型和下一句预测这2个任务进行自监督学习,使BERT能够有效地学习林业语义信息,得到具有林业文本通用特征的预训练语言模型ForestBERT。然后,对预训练语言模型mT5进行微调,实现样本的自动标注,通过人工校正后,构建包含3个主题共2280个样本的林业抽取式问答数据集。基于该数据集对BERT、RoBERTa、MacBERT、PERT、ELECTRA、LERT 6个通用领域的中文预训练语言模型以及本研究构建的ForestBERT进行训练和验证,以明确ForestBERT的优势。为探究不同主题对模型性能的影响,分别基于林业术语、林业法律法规、林业文献3个主题数据集对所有模型进行微调。将ForestBERT与BERT在林业文献中的问答结果进行可视化比较,以更直观展现ForestBERT的优势。【结果】ForestBERT在林业领域的抽取式问答任务中整体表现优于其他6个对比模型,与基础模型BERT相比,精确匹配(EM)分数和F1分数分别提升1.6%和1.72%,在另外5个模型的平均性能上也均提升0.96%。在各个模型最优划分比例下,ForestBERT在EM上分别优于BERT和其他5个模型2.12%和1.2%,在F1上分别优于1.88%和1.26%。此外,ForestBERT在3个林业主题上也均表现优异,术语、法律法规、文献任务的评估分数分别比其他6个模型平均提升3.06%、1.73%、2.76%。在所有模型中,术语任务表现最佳,F1的平均值达到87.63%,表现较差的法律法规也达到82.32%。在文献抽取式问答任务中,ForestBERT相比BERT可提供更准确、全面的答案。【结论】采用继续预训练的方式增强通用领域预训练语言模型的林业专业知识,可有效提升模型在林业抽取式问答任务中的表现,为林业文本和其他领域的文本处理和应用提供一种新思路。 【Objective】As for the problems of low utilization of forestry text,insufficient understanding of forestry knowledge by general-domain pre-trained language models,and the time-consuming nature of data annotation,this study makes full use of the massive forestry texts,proposes a pre-trained language model integrating forestry domain knowledge,and efficiently realizes the forestry extractive question answering by automatically annotating the training data,so as to provide intelligent information services for forestry decision-making and management.【Method】First,a forestry corpus was constructed using web crawler technology,encompassing three topics:terminology,law,and literature.This corpus was used to further pre-train the generaldomain pre-trained language model BERT.Through self-supervised learning of masked language model and next sentence prediction tasks,BERT was able to effectively learn forestry semantic information,resulting in the pre-trained language model ForestBERT,which has general features of forestry text.Subsequently,the pre-trained language model mT5 was fine-tuned to enable automatic labeling of samples.After manual correction,a forestry extractive question-answering dataset comprising 2280 samples across the three topics was constructed.Based on the dataset,the six general-domain Chinese pre-trained language models of BERT,RoBERTa,MacBERT,PERT,ELECTRA,and LERT,as well as ForestBERT that was specifically constructed in this study were trained and validated,to identify the advantages of ForestBERT.To investigate the impact of different topics on model performance,all models were fine-tuned on datasets related to the three topics:forestry terminology,forestry law,and forestry literature.Additionally,a visual comparison of the question-answering results in forestry literature between ForestBERT and BERT was performed to more intuitively demonstrate the advantages of ForestBERT.【Result】ForestBERT outperformed the other six comparison models in forestry extractive question-answering task.Compared to the base model BERT,ForestBERT improved the EM score and F1 score by 1.6%and 1.72%,respectively,and showed an average performance improvement of 0.96%over the other five models.Under the optimal division ratio for each model,ForestBERT outperformed BERT and the other five models in EM score by 2.12%and 1.2%,respectively,and in F1 score by 1.88%and 1.26%.Additionally,ForestBERT excelled in all three forestry topics by 3.06%,1.73%,2.76%higher than the other five models in evaluation scores for terminology,law,and literature tasks.In all models,the performance was the best in the terminology task,with an average F1 score of 87.63%,and the lowest in the law task,which still reached 82.32%.In the literature extractive question-answering task,ForestBERT provided more accurate and comprehensive answer compared to BERT.【Conclusion】Enhancing the domain-specific knowledge of forestry in general pre-trained language model through further pre-training can effectively improve the accuracy of the model in forestry extractive question-answering task,which provides a new approach for processing and applying texts in forestry and other fields.
作者 谭晶维 张怀清 刘洋 杨杰 郑东萍 Tan Jingwei;Zhang Huaiqing;Liu Yang;Yang Jie;Zheng Dongping(Institute of Forest Resource Information Techniques,Chinese Academy of Forestry Key Laboratory of Forestry Remote Sensing and Information System National Forestry and Grassland Administration,Beijing 100091;College of Forestry,Beijing Forestry University,Beijing 100083;University of Hawai’i at Mānoa Honolulu,HI 96822)
出处 《林业科学》 EI CAS CSCD 北大核心 2024年第9期99-110,共12页 Scientia Silvae Sinicae
基金 国家重点研发计划项目(2022YFE0128100)。
关键词 林业文本 BERT 预训练语言模型 特定领域预训练 抽取式问答任务 自然语言处理 forestry text BERT pre-trained language model domain-specific pre-training extractive question answering task natural language processing
  • 相关文献

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部