A large database is desired for machine learning(ML) technology to make accurate predictions of materials physicochemical properties based on their molecular structure.When a large database is not available,the develo...A large database is desired for machine learning(ML) technology to make accurate predictions of materials physicochemical properties based on their molecular structure.When a large database is not available,the development of proper featurization method based on physicochemical nature of target proprieties can improve the predictive power of ML models with a smaller database.In this work,we show that two new featurization methods,volume occupation spatial matrix and heat contribution spatial matrix,can improve the accuracy in predicting energetic materials' crystal density(ρ_(crystal)) and solid phase enthalpy of formation(H_(f,solid)) using a database containing 451 energetic molecules.Their mean absolute errors are reduced from 0.048 g/cm~3 and 24.67 kcal/mol to 0.035 g/cm~3 and 9.66 kcal/mol,respectively.By leave-one-out-cross-validation,the newly developed ML models can be used to determine the performance of most kinds of energetic materials except cubanes.Our ML models are applied to predict ρ_(crystal) and H_(f,solid) of CHON-based molecules of the 150 million sized PubChem database,and screened out 56 candidates with competitive detonation performance and reasonable chemical structures.With further improvement in future,spatial matrices have the potential of becoming multifunctional ML simulation tools that could provide even better predictions in wider fields of materials science.展开更多
The typical characteristic of the topology of Bayesian networks (BNs) is the interdependence among different nodes (variables), which makes it impossible to optimize one variable independently of others, and the learn...The typical characteristic of the topology of Bayesian networks (BNs) is the interdependence among different nodes (variables), which makes it impossible to optimize one variable independently of others, and the learning of BNs structures by general genetic algorithms is liable to converge to local extremum. To resolve efficiently this problem, a self-organizing genetic algorithm (SGA) based method for constructing BNs from databases is presented. This method makes use of a self-organizing mechanism to develop a genetic algorithm that extended the crossover operator from one to two, providing mutual competition between them, even adjusting the numbers of parents in recombination (crossover/recomposition) schemes. With the K2 algorithm, this method also optimizes the genetic operators, and utilizes adequately the domain knowledge. As a result, with this method it is able to find a global optimum of the topology of BNs, avoiding premature convergence to local extremum. The experimental results proved to be and the convergence of the SGA was discussed.展开更多
Deep learning has shown significant improvements on various machine learning tasks by introducing a wide spectrum of neural network models.Yet,for these neural network models,it is necessary to label a tremendous amou...Deep learning has shown significant improvements on various machine learning tasks by introducing a wide spectrum of neural network models.Yet,for these neural network models,it is necessary to label a tremendous amount of training data,which is prohibitively expensive in reality.In this paper,we propose OnLine Machine Learning(OLML)database which stores trained models and reuses these models in a new training task to achieve a better training effect with a small amount of training data.An efficient model reuse algorithm AdaReuse is developed in the OLML database.Specifically,AdaReuse firstly estimates the reuse potential of trained models from domain relatedness and model quality,through which a group of trained models with high reuse potential for the training task could be selected efficiently.Then,multi selected models will be trained iteratively to encourage diverse models,with which a better training effect could be achieved by ensemble.We evaluate AdaReuse on two types of natural language processing(NLP)tasks,and the results show AdaReuse could improve the training effect significantly compared with models training from scratch when the training data is limited.Based on AdaReuse,we implement an OLML database prototype system which could accept a training task as an SQL-like query and automatically generate a training plan by selecting and reusing trained models.Usability studies are conducted to illustrate the OLML database could properly store the trained models,and reuse the trained models efficiently in new training tasks.展开更多
基金support from the Ministry of Education(MOE) Singapore Tier 1 (RG8/20)。
文摘A large database is desired for machine learning(ML) technology to make accurate predictions of materials physicochemical properties based on their molecular structure.When a large database is not available,the development of proper featurization method based on physicochemical nature of target proprieties can improve the predictive power of ML models with a smaller database.In this work,we show that two new featurization methods,volume occupation spatial matrix and heat contribution spatial matrix,can improve the accuracy in predicting energetic materials' crystal density(ρ_(crystal)) and solid phase enthalpy of formation(H_(f,solid)) using a database containing 451 energetic molecules.Their mean absolute errors are reduced from 0.048 g/cm~3 and 24.67 kcal/mol to 0.035 g/cm~3 and 9.66 kcal/mol,respectively.By leave-one-out-cross-validation,the newly developed ML models can be used to determine the performance of most kinds of energetic materials except cubanes.Our ML models are applied to predict ρ_(crystal) and H_(f,solid) of CHON-based molecules of the 150 million sized PubChem database,and screened out 56 candidates with competitive detonation performance and reasonable chemical structures.With further improvement in future,spatial matrices have the potential of becoming multifunctional ML simulation tools that could provide even better predictions in wider fields of materials science.
文摘The typical characteristic of the topology of Bayesian networks (BNs) is the interdependence among different nodes (variables), which makes it impossible to optimize one variable independently of others, and the learning of BNs structures by general genetic algorithms is liable to converge to local extremum. To resolve efficiently this problem, a self-organizing genetic algorithm (SGA) based method for constructing BNs from databases is presented. This method makes use of a self-organizing mechanism to develop a genetic algorithm that extended the crossover operator from one to two, providing mutual competition between them, even adjusting the numbers of parents in recombination (crossover/recomposition) schemes. With the K2 algorithm, this method also optimizes the genetic operators, and utilizes adequately the domain knowledge. As a result, with this method it is able to find a global optimum of the topology of BNs, avoiding premature convergence to local extremum. The experimental results proved to be and the convergence of the SGA was discussed.
基金the National Natural Science Foundation of China under Grant No.62072458.
文摘Deep learning has shown significant improvements on various machine learning tasks by introducing a wide spectrum of neural network models.Yet,for these neural network models,it is necessary to label a tremendous amount of training data,which is prohibitively expensive in reality.In this paper,we propose OnLine Machine Learning(OLML)database which stores trained models and reuses these models in a new training task to achieve a better training effect with a small amount of training data.An efficient model reuse algorithm AdaReuse is developed in the OLML database.Specifically,AdaReuse firstly estimates the reuse potential of trained models from domain relatedness and model quality,through which a group of trained models with high reuse potential for the training task could be selected efficiently.Then,multi selected models will be trained iteratively to encourage diverse models,with which a better training effect could be achieved by ensemble.We evaluate AdaReuse on two types of natural language processing(NLP)tasks,and the results show AdaReuse could improve the training effect significantly compared with models training from scratch when the training data is limited.Based on AdaReuse,we implement an OLML database prototype system which could accept a training task as an SQL-like query and automatically generate a training plan by selecting and reusing trained models.Usability studies are conducted to illustrate the OLML database could properly store the trained models,and reuse the trained models efficiently in new training tasks.