N-11-azaartemisinins potentially active against Plasmodium falciparum are designed by combining molecular electrostatic potential (MEP), ligand-receptor interaction, and models built with supervised machine learning m...N-11-azaartemisinins potentially active against Plasmodium falciparum are designed by combining molecular electrostatic potential (MEP), ligand-receptor interaction, and models built with supervised machine learning methods (PCA, HCA, KNN, SIMCA, and SDA). The optimization of molecular structures was performed using the B3LYP/6-31G* approach. MEP maps and ligand-receptor interactions were used to investigate key structural features required for biological activities and likely interactions between N-11-azaartemisinins and heme, respectively. The supervised machine learning methods allowed the separation of the investigated compounds into two classes: cha and cla, with the properties ε<sub>LUMO+1</sub> (one level above lowest unoccupied molecular orbital energy), d(C<sub>6</sub>-C<sub>5</sub>) (distance between C<sub>6</sub> and C<sub>5</sub> atoms in ligands), and TSA (total surface area) responsible for the classification. The insights extracted from the investigation developed and the chemical intuition enabled the design of sixteen new N-11-azaartemisinins (prediction set), moreover, models built with supervised machine learning methods were applied to this prediction set. The result of this application showed twelve new promising N-11-azaartemisinins for synthesis and biological evaluation.展开更多
Point-of-interest(POI) recommendation is a popular topic on location-based social networks(LBSNs).Geographical proximity,known as a unique feature of LBSNs,significantly affects user check-in behavior.However,most of ...Point-of-interest(POI) recommendation is a popular topic on location-based social networks(LBSNs).Geographical proximity,known as a unique feature of LBSNs,significantly affects user check-in behavior.However,most of prior studies characterize the geographical influence based on a universal or personalized distribution of geographic distance,leading to unsatisfactory recommendation results.In this paper,the personalized geographical influence in a two-dimensional geographical space is modeled using the data field method,and we propose a semi-supervised probabilistic model based on a factor graph model to integrate different factors such as the geographical influence.Moreover,a distributed learning algorithm is used to scale up our method to large-scale data sets.Experimental results based on the data sets from Foursquare and Gowalla show that our method outperforms other competing POI recommendation techniques.展开更多
Supervised topic modeling algorithms have been successfully applied to multi-label document classification tasks.Representative models include labeled latent Dirichlet allocation(L-LDA)and dependency-LDA.However,these...Supervised topic modeling algorithms have been successfully applied to multi-label document classification tasks.Representative models include labeled latent Dirichlet allocation(L-LDA)and dependency-LDA.However,these models neglect the class frequency information of words(i.e.,the number of classes where a word has occurred in the training data),which is significant for classification.To address this,we propose a method,namely the class frequency weight(CF-weight),to weight words by considering the class frequency knowledge.This CF-weight is based on the intuition that a word with higher(lower)class frequency will be less(more)discriminative.In this study,the CF-weight is used to improve L-LDA and dependency-LDA.A number of experiments have been conducted on real-world multi-label datasets.Experimental results demonstrate that CF-weight based algorithms are competitive with the existing supervised topic models.展开更多
Various kinds of online social media applications such as Twitter and Weibo,have brought a huge volume of short texts.However,mining semantic topics from short texts efficiently is still a challenging problem because ...Various kinds of online social media applications such as Twitter and Weibo,have brought a huge volume of short texts.However,mining semantic topics from short texts efficiently is still a challenging problem because of the sparseness of word-occurrence and the diversity of topics.To address the above problems,we propose a novel supervised pseudo-document-based maximum entropy discrimination latent Dirichlet allocation model(PSLDA for short).Specifically,we first assume that short texts are generated from the normal size latent pseudo documents,and the topic distributions are sampled from the pseudo documents.In this way,the model will reduce the sparseness of word-occurrence and the diversity of topics because it implicitly aggregates short texts to longer and higher-level pseudo documents.To make full use of labeled information in training data,we introduce labels into the model,and further propose a supervised topic model to learn the reasonable distribution of topics.Extensive experiments demonstrate that our proposed method achieves better performance compared with some state-of-the-art methods.展开更多
Grain security guarantees national security.China has many widely distributed grain depots to supervise grain storage security.However,this has led to a lack of regulatory capacity and manpower.Amid the development of...Grain security guarantees national security.China has many widely distributed grain depots to supervise grain storage security.However,this has led to a lack of regulatory capacity and manpower.Amid the development of reserve-level information technology,big data supervision of grain storage security should be improved.This study proposes big data research architecture and an analysis model for grain storage security;as an example,it illustrates the supervision of the grain loss problem in storage security.The statistical analysis model and the prediction and clustering-based model for grain loss supervision were used to mine abnormal data.A combination of feature extraction and feature selection reduction methods were chosen for dimensionality.A comparative analysis showed that the nonlinear prediction model performed better on the grain loss data set,with R2 of 87.21%,87.83%,91.97%,and 89.40%for Gradient Boosting Regressor(GBR),Random Forest,Decision Tree,XGBoost regression on test sets,respectively.Nineteen abnormal data were filtered out by GBR combined with residuals as an example.The deep learning model had the best performance on the mean absolute error,with an R2 of 85.14%on the test set and only one abnormal data identified.This is contrary to the original intention of finding as many anomalies as possible for supervisory purposes.Five classes were generated using principal component analysis dimensionality reduction combined with Density-Based Spatial Clustering of Applications with Noise(DBSCAN)clustering,with 11 anomalous data points screened by adding the amount of normalized grain loss.Based on the existing grain information system,this paper provides a supervision model for grain storage that can help mine abnormal data.Unlike the current post-event supervision model,this study proposes a pre-event supervision model.This study provides a framework of ideas for subsequent scholarly research;the addition of big data technology will help improve efficient supervisory capacity in the field of grain supervision.展开更多
This paper poses a question:How many types of social relations can be categorized in the Chinese context?In social networks,the calculation of tie strength can better represent the degree of intimacy of the relationsh...This paper poses a question:How many types of social relations can be categorized in the Chinese context?In social networks,the calculation of tie strength can better represent the degree of intimacy of the relationship between nodes,rather than just indicating whether the link exists or not.Previou research suggests that Granovetter measures tie strength so as to distinguish strong ties from weak ties,and the Dunbar circle theory may offer a plausible approach to calculating 5 types of relations according to interaction frequency via unsupervised learning(e.g.,clustering interactive data between users in Facebook and Twitter).In this paper,we differentiate the layers of an ego-centered network by measuring the different dimensions of user's online interaction data based on the Dunbar circle theory.To label the types of Chinese guanxi,we conduct a survey to collect the ground truth from the real world and link this survey data to big data collected from a widely used social network platform in China.After repeating the Dunbar experiments,we modify our computing methods and indicators computed from big data in order to have a model best fit for the ground truth.At the same time,a comprehensive set of effective predictors are selected to have a dialogue with existing theories of tie strength.Eventually,by combining Guanxi theory with Dunbar circle studies,four types of guanxi are found to represent a four-layer model of a Chinese ego-centered network.展开更多
文摘N-11-azaartemisinins potentially active against Plasmodium falciparum are designed by combining molecular electrostatic potential (MEP), ligand-receptor interaction, and models built with supervised machine learning methods (PCA, HCA, KNN, SIMCA, and SDA). The optimization of molecular structures was performed using the B3LYP/6-31G* approach. MEP maps and ligand-receptor interactions were used to investigate key structural features required for biological activities and likely interactions between N-11-azaartemisinins and heme, respectively. The supervised machine learning methods allowed the separation of the investigated compounds into two classes: cha and cla, with the properties ε<sub>LUMO+1</sub> (one level above lowest unoccupied molecular orbital energy), d(C<sub>6</sub>-C<sub>5</sub>) (distance between C<sub>6</sub> and C<sub>5</sub> atoms in ligands), and TSA (total surface area) responsible for the classification. The insights extracted from the investigation developed and the chemical intuition enabled the design of sixteen new N-11-azaartemisinins (prediction set), moreover, models built with supervised machine learning methods were applied to this prediction set. The result of this application showed twelve new promising N-11-azaartemisinins for synthesis and biological evaluation.
基金supported by National Key Basic Research Program of China(973 Program) under Grant No.2014CB340404National Natural Science Foundation of China under Grant Nos.61272111 and 61273216Youth Chenguang Project of Science and Technology of Wuhan City under Grant No. 2014070404010232
文摘Point-of-interest(POI) recommendation is a popular topic on location-based social networks(LBSNs).Geographical proximity,known as a unique feature of LBSNs,significantly affects user check-in behavior.However,most of prior studies characterize the geographical influence based on a universal or personalized distribution of geographic distance,leading to unsatisfactory recommendation results.In this paper,the personalized geographical influence in a two-dimensional geographical space is modeled using the data field method,and we propose a semi-supervised probabilistic model based on a factor graph model to integrate different factors such as the geographical influence.Moreover,a distributed learning algorithm is used to scale up our method to large-scale data sets.Experimental results based on the data sets from Foursquare and Gowalla show that our method outperforms other competing POI recommendation techniques.
基金Project supported by the National Natural Science Foundation of China(No.61602204)
文摘Supervised topic modeling algorithms have been successfully applied to multi-label document classification tasks.Representative models include labeled latent Dirichlet allocation(L-LDA)and dependency-LDA.However,these models neglect the class frequency information of words(i.e.,the number of classes where a word has occurred in the training data),which is significant for classification.To address this,we propose a method,namely the class frequency weight(CF-weight),to weight words by considering the class frequency knowledge.This CF-weight is based on the intuition that a word with higher(lower)class frequency will be less(more)discriminative.In this study,the CF-weight is used to improve L-LDA and dependency-LDA.A number of experiments have been conducted on real-world multi-label datasets.Experimental results demonstrate that CF-weight based algorithms are competitive with the existing supervised topic models.
文摘Various kinds of online social media applications such as Twitter and Weibo,have brought a huge volume of short texts.However,mining semantic topics from short texts efficiently is still a challenging problem because of the sparseness of word-occurrence and the diversity of topics.To address the above problems,we propose a novel supervised pseudo-document-based maximum entropy discrimination latent Dirichlet allocation model(PSLDA for short).Specifically,we first assume that short texts are generated from the normal size latent pseudo documents,and the topic distributions are sampled from the pseudo documents.In this way,the model will reduce the sparseness of word-occurrence and the diversity of topics because it implicitly aggregates short texts to longer and higher-level pseudo documents.To make full use of labeled information in training data,we introduce labels into the model,and further propose a supervised topic model to learn the reasonable distribution of topics.Extensive experiments demonstrate that our proposed method achieves better performance compared with some state-of-the-art methods.
文摘Grain security guarantees national security.China has many widely distributed grain depots to supervise grain storage security.However,this has led to a lack of regulatory capacity and manpower.Amid the development of reserve-level information technology,big data supervision of grain storage security should be improved.This study proposes big data research architecture and an analysis model for grain storage security;as an example,it illustrates the supervision of the grain loss problem in storage security.The statistical analysis model and the prediction and clustering-based model for grain loss supervision were used to mine abnormal data.A combination of feature extraction and feature selection reduction methods were chosen for dimensionality.A comparative analysis showed that the nonlinear prediction model performed better on the grain loss data set,with R2 of 87.21%,87.83%,91.97%,and 89.40%for Gradient Boosting Regressor(GBR),Random Forest,Decision Tree,XGBoost regression on test sets,respectively.Nineteen abnormal data were filtered out by GBR combined with residuals as an example.The deep learning model had the best performance on the mean absolute error,with an R2 of 85.14%on the test set and only one abnormal data identified.This is contrary to the original intention of finding as many anomalies as possible for supervisory purposes.Five classes were generated using principal component analysis dimensionality reduction combined with Density-Based Spatial Clustering of Applications with Noise(DBSCAN)clustering,with 11 anomalous data points screened by adding the amount of normalized grain loss.Based on the existing grain information system,this paper provides a supervision model for grain storage that can help mine abnormal data.Unlike the current post-event supervision model,this study proposes a pre-event supervision model.This study provides a framework of ideas for subsequent scholarly research;the addition of big data technology will help improve efficient supervisory capacity in the field of grain supervision.
基金project number:20182001706the support of Tsinghua-Gottingen Student Exchange Project IDS-SSP-2017001.
文摘This paper poses a question:How many types of social relations can be categorized in the Chinese context?In social networks,the calculation of tie strength can better represent the degree of intimacy of the relationship between nodes,rather than just indicating whether the link exists or not.Previou research suggests that Granovetter measures tie strength so as to distinguish strong ties from weak ties,and the Dunbar circle theory may offer a plausible approach to calculating 5 types of relations according to interaction frequency via unsupervised learning(e.g.,clustering interactive data between users in Facebook and Twitter).In this paper,we differentiate the layers of an ego-centered network by measuring the different dimensions of user's online interaction data based on the Dunbar circle theory.To label the types of Chinese guanxi,we conduct a survey to collect the ground truth from the real world and link this survey data to big data collected from a widely used social network platform in China.After repeating the Dunbar experiments,we modify our computing methods and indicators computed from big data in order to have a model best fit for the ground truth.At the same time,a comprehensive set of effective predictors are selected to have a dialogue with existing theories of tie strength.Eventually,by combining Guanxi theory with Dunbar circle studies,four types of guanxi are found to represent a four-layer model of a Chinese ego-centered network.