摘要
目的:构建基于医学文本的预训练语言模型,以解决基于通用语料的预训练语言模型不适应医学文本分类的问题。方法:使用PubMed医学论文摘要数据和PMC医学论文全文数据在通用预训练语言模型Bert上进行二次预训练,得到医学领域的预训练语言模型BioBert,使用标注好的文本数据对BioBert进行微调,得到最终的医学文本分类模型。结果:病历文本和医学论文摘要文本两个数据集的分类实验显示,经过医学文本二次预训练的预训练语言模型在两个数据集上都取得了较好的分类效果。结论:通过自训练的方式对大量医学文本进行预训练得到的医学领域预训练语言模型,能在一定程度上解决使用通用预训练语言模型无法很好适配医学文本分布而导致分类性能偏低的问题。
Objective To establish the medical texts-based pre-trained language model in order to solve the general corpus-based pre-trained language model which is not adaptable to the classification of medical texts.Methods The BioBert,a pre-trained language model of medical domain,was established by a secondary training in Bert,a general pre-trained language model,using the PubMed-covered data of medical abstracts and PMC-covered data of medical papers,and the classification model of medical texts was established by minor adjustment of BioBert using the marked text data.Results The classification of medical records text and medical abstracts text showed good results in the pre-trained language model after their secondary pre-training.Conclusion The pre-trained language model of medical domain established by pre-training a large number of medical texts can,in a certain degree,solve the low classification performance due to the distribution of medical texts which is not adaptable to the general pre-trained language model.
作者
黄敏婷
赵静
于涛
HUANG Min-ting;ZHAO Jing;YU Tao(Beijing University of Traditional Chinese Medicine, Beijing 100029,China;Nanyang University of Science and Tecgnology, Singapore 639798,China)
出处
《中华医学图书情报杂志》
CAS
2020年第11期39-46,共8页
Chinese Journal of Medical Library and Information Science
关键词
医学文本
预训练语言模型
文本分类
Medical text
Pre-trained language model
Text classification