摘要
近年来,材料基因组计划(material genome initiative,MGI)已成为全球热点。数据源的缺乏和数据存储方法的不规范导致材料领域缺乏可用于机器学习模型训练的结构化数据,这成为了研究人员对材料性能进行预测的瓶颈。随着材料科学的不断发展,材料领域文本中包含的大量信息,已成为材料领域研究人员应用机器学习的主要数据来源,如何获取大量有效的材料数据是成为现阶段的一项具有挑战意义的工作。本论文采用自然语言处理技术从铝硅合金材料文献中获取有效数据。命名实体识别是自然语言处理中一项重要的子任务,旨在识别文本中具有特定意义的实体。具体研究方法是从材料科学文献中选择五类实体,手工标注构建了铝硅合金材料实体识别数据集,包括5347个句子,2835个实体。为了减少自然语言处理任务对标注语料的依赖,利用迁移学习将语言模型预训练后应用到特定领域任务中;结合实体特征,基于ALBERT(A Lite BERT)预训练语言模型与条件随机场(conditional random fields,CRF)进行联合建模,并将预训练模型基于主动学习应用于合金材料实体识别。在基于少量标注的训练集样本下,结合主动学习,使得模型的F1值、精确率、召回率分别提高了0.61%,2.68%,0.29%。实验证明结合预训练和主动学习能够进一步减少实体识别任务模型对标注数据的依赖及人工标注的成本。论文研究成果可解决材料数据孤岛问题,改善材料基因组机器学习一直处于小规模数据集的困境,将促进铝硅合金的研发进程,为材料基因组新材料设计提供科学依据。
In recent years,Material Genome Initiative(MGI)has become a global hot spot.The lack of data sources and irregular data storage methods have led to a lack of structured data that can be used for machine learning model training in the materials field,which has become a bottleneck for researchers in predicting material performance.With the continuous development of materials science,a large amount of information contained in the materials field text has become the focus of attention for researchers,and has become the main data sources for materials field personnel to apply machine learning.How to obtain a large amount of effective materials data is a new challenge at this stage.This article uses natural language processing technology to obtain valid data from the aluminum-silicon alloy materials literature.Named entity recognition is an important subtask in natural language processing,which aims to identify entities with meaning in text.In.this paper,five types of entities are selected from the material science literature,and an aluminum-silicon alloy material entity recognition data set is constructed by hand annotation,which includes 5347 sentences and 2835 entities.In order to reduce the dependence of natural language processing tasks on annotation expectations,transfer learning is used to pre-train the language model and apply it to specific domain tasks.Combining entity characteristics,joint modeling is carried out based on ALBERT(A Lite BERT)pre-training language model and conditional random fields(CRF),and the pre-training model is applied to alloy material entity recognition based on active learning.Based on a small number of labeled training set samples,com bined with active learning,the F1 value,accuracy rate,and recall rate of the model are increased by 0.61%,2.68%,and 0.29%,respectively.Experiments prove that combining pre training and active learning can further reduce the dependence of entity recognition task models on labeled data and the cost of manual labeling.The research results of this paper can solve the problem of material data islands and improve the problem of material genome machine learning,which has been in the dilemma of small-scale data sets.It will promote the development of aluminum-silicon alloys and provide a scientific basis for the design of new materials for material genomes.
作者
刘英莉
李武亮
牛琛
么长慧
尹建成
沈韬
LIU Yingli;LI Wuliang;NIU Chen;YAO Changhui;YIN Jiancheng;SHEN Tao(Yunnan Key Laboratory of Computer Technology Application,Kunming University of Science and Technology,Kunming 650500,China;Faculty of Information Engineering and Automation,Kunming University of Science and Technology,Kunming 650500,China;Faculty of Materials Science and Engineering,Kunming University of Science and Technology,Kunming 650500,China)
出处
《材料科学与工程学报》
CAS
CSCD
北大核心
2022年第4期640-645,667,共7页
Journal of Materials Science and Engineering
基金
国家自然科学基金资助项目(52061020,61971208,51864027)
云南计算机技术应用重点实验室开放基金资助项目(2020103)。
关键词
材料基因组
文本识别
材料命名实体识别
迁移学习
预训练语言模型
Material genome
Text recognition
Material named entity recognition
Transfer learning
Pre-trained language model