融合材料领域知识的数据准确性检测方法被引量：3

Detection Method on Data Accuracy Incorporating Materials Domain Knowledge

下载PDF

导出

摘要材料数据由于小样本、高维度、噪音大等特性,用于机器学习建模时常常会产生与领域专家认知不一致的结果。面向机器学习全流程,开发材料领域知识嵌入的机器学习模型是解决这一问题的有效途径。材料数据的准确性直接影响了数据驱动的材料性能预测的可靠性。本研究针对机器学习应用过程中的数据预处理阶段,提出了融合材料领域知识的数据准确性检测方法。该方法首先结合材料专家认知构建了材料领域知识库。然后,将其与数据驱动的数据准确性检测方法结合,从数据和领域知识两个角度对材料数据集进行基于描述符取值规则的单维度数据正确性检测、基于描述符相关性规则的多维度数据相关性检测以及基于多维相似样本识别策略的全维度数据可靠性检测。对于每一阶段识别出的异常数据,结合材料领域知识进行修正,并将领域知识融入到数据准确性检测方法的全过程以确保数据集从初始阶段就具有较高准确性。最后该方法在NASICON型固态电解质激活能预测数据集上的实验结果表明:本研究提出的方法可以有效识别异常数据并进行合理修正。与原始数据集相比,基于修正数据集的6种机器学习模型的预测精度都有不同程度的提升。其中,在最优模型上R2提升了33%。 Due to the characteristics of small samples, high dimensions, and much noise, materials data often produce inconsistent results with those obtained from domain experts when used for machine learning modeling.For the whole process of machine learning, developing machine learning models embedding materials domain knowledge is a solution to this problem. The accuracy of materials data directly affects the reliability of data-driven materials performance prediction. Here, a data accuracy detection method incorporating materials domain knowledge is proposed by focusing on the data preprocessing stage in the machine learning application process. Firstly, a materials domain knowledge database is constructed based on the knowledge from materials experts. Secondly, it is coordinated with the data-driven data accuracy detection method to perform single-dimensional data accuracy detection based on the rule for value of descriptors, multi-dimensional data correlation detection based on the rule for correlation of descriptors, and full-dimensional data reliable detection based on multi-dimensional similar sample identification strategy from both data and domain knowledge perspectives. For the anomalous data identified at each stage, they are corrected by incorporating the materials domain knowledge. Furthermore, domain knowledge is incorporated into the whole process of the data accuracy detection method to ensure high accuracy of the dataset from the initial stage. Finally, experiments on the NASICON-type solid electrolyte activation energy prediction dataset demonstrate that this method can effectively identify anomalous data and make reasonable corrections. Compared with the original dataset, the prediction accuracy of all six machine learning models based on the revised dataset is improved to different degrees, among which R~2 achieves a 33% improvement on the optimal model.

作者施思齐孙拾雨马舒畅邹欣欣钱权刘悦 SHI Siqi;SUN Shiyu;MA Shuchang;ZOU Xinxin;QIAN Quan;LIU Yue(Materials Genome Institute,Shanghai University,Shanghai 200444,China;School of Materials Science and Engineering,Shanghai University,Shanghai 200444,China;School of Computer Engineering and Science,Shanghai University,Shanghai200444,China;Shanghai Engineering Research Center of Intelligent Computing System,Shanghai University,Shanghai200444,China;Zhejiang Laboratory,Hangzhou 311100,China)

机构地区上海大学材料基因组工程研究院上海大学材料科学与工程学院上海大学计算机工程与科学学院上海大学上海市智能计算系统工程技术研究中心之江实验室

出处《无机材料学报》 SCIE EI CAS CSCD 北大核心 2022年第12期1311-1320,I0001-I0005,共15页 Journal of Inorganic Materials

基金国家重点研发计划(2021YFB3802101) 国家自然科学基金(52073169) 之江实验室科研攻关项目(2021PE0AC02)。

关键词机器学习材料科学数据质量领域知识 machine learning materials science data quality domain knowledge

分类号 TP181 [自动化与计算机技术—控制理论与控制工程] O646 [理学—物理化学] TB30 [一般工业技术—材料科学与工程]

引文网络
相关文献

同被引文献24

1彭佳悦,祖晨曦,李泓.锂电池基础科学问题(Ⅰ)——化学储能电池理论能量密度的估算[J].储能科学与技术,2013,2(1):55-62. 被引量：35
2黄杰,凌仕刚,王雪龙,蒋礼威,胡勇胜,肖睿娟,李泓.锂离子电池基础科学问题(ⅩⅣ)——计算方法[J].储能科学与技术,2015,4(2):215-230. 被引量：5
3施思齐,徐积维,崔艳华,鲁晓刚,欧阳楚英,张艳姝,张文清.多尺度材料计算方法[J].科技导报,2015,33(10):20-30. 被引量：10
4徐永林,王香蒙,李鑫,席丽丽,倪剑樾,朱文浩,张武,杨炯.基于高通量计算及机器学习的新材料带隙预测[J].中国科学：技术科学,2019,49(1):44-54. 被引量：15
5宋佳,温亮明,李洋.科学数据共享FAIR原则:背景、内容及实践[J].情报资料工作,2021,42(1):57-68. 被引量：42
6Yue Liu,Tianlu Zhao,Wangwei Ju,Siqi Shi.Materials discovery and design using machine learning[J].Journal of Materiomics,2017,3(3):159-177. 被引量：90
7Xuelong Wang,Ruijuan Xiao,Hong Li,Liquan Chen.Quantitative structure-property relationship study of cathode volume changes in lithium ion batteries using ab-initio and partial least squares analysis[J].Journal of Materiomics,2017,3(3):178-183. 被引量：8
8Scott Kirklin,James E Saal,Bryce Meredig,Alex Thompson,Jeff W Doak,Muratahan Aykol,Stephan Rühl,Chris Wolverton.The Open Quantum Materials Database (OQMD): assessing the accuracy of DFT formation energies[J].npj Computational Materials,2015(1):15-29. 被引量：85
9Yao Wu,Jie Guo,Rui Sun,Jie Min.Machine learning for accelerating the discovery of high-performance donor/acceptor pairs in non-fullerene organic solar cells[J].npj Computational Materials,2020(1):645-652. 被引量：5
10Yabo Dan,Yong Zhao,Xiang Li,Shaobo Li,Ming Hu,Jianjun Hu.Generative adversarial networks (GAN) based efficient sampling of chemical composition space for inverse design of inorganic materials[J].npj Computational Materials,2020(1):964-970. 被引量：13

引证文献3

1刘悦,马舒畅,杨正伟,邹欣欣,施思齐.面向材料领域机器学习的数据质量治理[J].硅酸盐学报,2023,51(2):427-437. 被引量：7
2张奇,彭超,薛冬峰.数据驱动储能电池新材料的筛选和设计[J].中国科学：技术科学,2024,54(4):584-600. 被引量：1
3刘悦,姚文轩,刘大晖,丁琳,杨正伟,刘微,于涛,施思齐.高质量文本数据驱动的命名实体识别加速镍基单晶高温合金材料知识发现[J].金属学报,2024,60(10):1429-1438.

二级引证文献8

1张婧睿,孙蒙鸽,韩涛.科研智能化趋势下科研数据研究[J].科学观察,2023,18(4):49-61. 被引量：3
2李丰范,匡健隆,季佳浩,商春磊,吴宏辉,汪水泽,毛新平.机器学习在金属材料服役性能预测中的应用[J].工程科学学报,2024,46(1):120-136. 被引量：2
3张奇,彭超,薛冬峰.数据驱动储能电池新材料的筛选和设计[J].中国科学：技术科学,2024,54(4):584-600. 被引量：1
4刘咏军,周韶泽,冯显锟.基于数据驱动的轨道车辆焊缝疲劳寿命预测[J].电力机车与城轨车辆,2024,47(4):73-78.
5刘悦,姚文轩,刘大晖,丁琳,杨正伟,刘微,于涛,施思齐.高质量文本数据驱动的命名实体识别加速镍基单晶高温合金材料知识发现[J].金属学报,2024,60(10):1429-1438.
6张文生,曹傅荔,郅晓,叶家元,任雪红.机器学习方法用于水泥基材料的研究进展[J].硅酸盐学报,2024,52(11):3617-3630. 被引量：1
7董双丽,姬瑶,明红,王伟超,张勤远.激光玻璃光学光谱特性预测:理论与实践[J].中国科学：技术科学,2024,54(11):2179-2192.
8刘城城,魏海霞,付奎源,苏航.机器学习在材料科学中的应用[J].鞍钢技术,2024(6):34-49.

1杨绍祥,邓兵,端宁.荧光探针检测有机溶剂水含量的研究进展[J].精细化工,2022,39(11):2203-2214. 被引量：1
2莫兆宗.基于深度学习的小学数学有效教学策略研究[J].名师在线（中英文）,2022(35):58-60. 被引量：8
3罗琴,杨根,刘智,唐宾徽.结合主动学习的威胁情报IOC识别方法[J].电子科技大学学报,2023,52(1):108-115. 被引量：2
4张恺芳,张梦洁,徐杨,赵剑锋,王爱平.解剖学知识融入生理学教学中的探索与实践[J].基础医学教育,2023,25(1):9-12. 被引量：3
5张玉玺,王增平,李振钊,徐潜.基于特征频带暂态无功功率的配电网故障选线新方法[J].电力系统保护与控制,2023,51(1):1-11. 被引量：10
6张芳芳,金听祥,张波,李国培.工程案例在专业学位研究生工程热力学课程教学中的应用[J].河南化工,2022,39(12):65-66.
7陶氏公司推出陶氏涂景数字化技术服务平台[J].上海涂料,2022,60(6):28-28.
8张勃兴,张寿明,钟震宇.基于多粒度特征融合网络的行人重识别[J].光电子．激光,2022,33(9):977-983. 被引量：2
9罗艺,夏书海,牛波,张亚运,龙东辉.柔性有机硅气凝胶的制备及其高温无机化转变研究[J].无机材料学报,2022,37(12):1281-1288. 被引量：3
10陈小帆.汉字的文化因素在对外汉字教学中的影响研究[J].汉字文化,2022(11):84-85. 被引量：1

无机材料学报

2022年第12期

浏览历史

内容加载中请稍等...

融合材料领域知识的数据准确性检测方法被引量：3

同被引文献24

引证文献3

二级引证文献8

相关作者

相关机构

相关主题

浏览历史

融合材料领域知识的数据准确性检测方法 被引量：3

同被引文献24

引证文献3

二级引证文献8

相关作者

相关机构

相关主题

浏览历史

融合材料领域知识的数据准确性检测方法被引量：3