摘要
Digitizing large collections of scientific literature can enable new informatics approaches for scientific analysis and meta-analysis.However,most content in the scientific literature is locked-up in written natural language,which is difficult to parse into databases using explicitly hard-coded classification rules.In this work,we demonstrate a semi-supervised machine-learning method to classify inorganic materials synthesis procedures from written natural language.Without any human input,latent Dirichlet allocation can cluster keywords into topics corresponding to specific experimental materials synthesis steps,such as“grinding”and“heating”,“dissolving”and“centrifuging”,etc.Guided by a modest amount of annotation,a random forest classifier can then associate these steps with different categories of materials synthesis,such as solid-state or hydrothermal synthesis.Finally,we show that a Markov chain representation of the order of experimental steps accurately reconstructs a flowchart of possible synthesis procedures.Our machine-learning approach enables a scalable approach to unlock the large amount of inorganic materials synthesis information from the literature and to process it into a standardized,machine-readable database.
基金
Funding to support this work was provided by the Energy&Biosciences Institute through the EBI-Shell program,Office of Naval Research(ONR)Award #N00014-14-1-0444
the National Science Foundation under Grant No 5710003959.