Supervised topic modeling algorithms have been successfully applied to multi-label document classification tasks.Representative models include labeled latent Dirichlet allocation(L-LDA)and dependency-LDA.However,these...Supervised topic modeling algorithms have been successfully applied to multi-label document classification tasks.Representative models include labeled latent Dirichlet allocation(L-LDA)and dependency-LDA.However,these models neglect the class frequency information of words(i.e.,the number of classes where a word has occurred in the training data),which is significant for classification.To address this,we propose a method,namely the class frequency weight(CF-weight),to weight words by considering the class frequency knowledge.This CF-weight is based on the intuition that a word with higher(lower)class frequency will be less(more)discriminative.In this study,the CF-weight is used to improve L-LDA and dependency-LDA.A number of experiments have been conducted on real-world multi-label datasets.Experimental results demonstrate that CF-weight based algorithms are competitive with the existing supervised topic models.展开更多
Various kinds of online social media applications such as Twitter and Weibo,have brought a huge volume of short texts.However,mining semantic topics from short texts efficiently is still a challenging problem because ...Various kinds of online social media applications such as Twitter and Weibo,have brought a huge volume of short texts.However,mining semantic topics from short texts efficiently is still a challenging problem because of the sparseness of word-occurrence and the diversity of topics.To address the above problems,we propose a novel supervised pseudo-document-based maximum entropy discrimination latent Dirichlet allocation model(PSLDA for short).Specifically,we first assume that short texts are generated from the normal size latent pseudo documents,and the topic distributions are sampled from the pseudo documents.In this way,the model will reduce the sparseness of word-occurrence and the diversity of topics because it implicitly aggregates short texts to longer and higher-level pseudo documents.To make full use of labeled information in training data,we introduce labels into the model,and further propose a supervised topic model to learn the reasonable distribution of topics.Extensive experiments demonstrate that our proposed method achieves better performance compared with some state-of-the-art methods.展开更多
基金Project supported by the National Natural Science Foundation of China(No.61602204)
文摘Supervised topic modeling algorithms have been successfully applied to multi-label document classification tasks.Representative models include labeled latent Dirichlet allocation(L-LDA)and dependency-LDA.However,these models neglect the class frequency information of words(i.e.,the number of classes where a word has occurred in the training data),which is significant for classification.To address this,we propose a method,namely the class frequency weight(CF-weight),to weight words by considering the class frequency knowledge.This CF-weight is based on the intuition that a word with higher(lower)class frequency will be less(more)discriminative.In this study,the CF-weight is used to improve L-LDA and dependency-LDA.A number of experiments have been conducted on real-world multi-label datasets.Experimental results demonstrate that CF-weight based algorithms are competitive with the existing supervised topic models.
文摘Various kinds of online social media applications such as Twitter and Weibo,have brought a huge volume of short texts.However,mining semantic topics from short texts efficiently is still a challenging problem because of the sparseness of word-occurrence and the diversity of topics.To address the above problems,we propose a novel supervised pseudo-document-based maximum entropy discrimination latent Dirichlet allocation model(PSLDA for short).Specifically,we first assume that short texts are generated from the normal size latent pseudo documents,and the topic distributions are sampled from the pseudo documents.In this way,the model will reduce the sparseness of word-occurrence and the diversity of topics because it implicitly aggregates short texts to longer and higher-level pseudo documents.To make full use of labeled information in training data,we introduce labels into the model,and further propose a supervised topic model to learn the reasonable distribution of topics.Extensive experiments demonstrate that our proposed method achieves better performance compared with some state-of-the-art methods.