As a generative model,Latent Dirichlet Allocation Model,which lacks optimization of topics' discrimination capability focuses on how to generate data,This paper aims to improve the discrimination capability throug...As a generative model,Latent Dirichlet Allocation Model,which lacks optimization of topics' discrimination capability focuses on how to generate data,This paper aims to improve the discrimination capability through unsupervised feature selection.Theoretical analysis shows that the discrimination capability of a topic is limited by the discrimination capability of its representative words.The discrimination capability of a word is approximated by the Information Gain of the word for topics,which is used to distinguish between "general word" and "special word" in LDA topics.Therefore,we add a constraint to the LDA objective function to let the "general words" only happen in "general topics" other than "special topics".Then a heuristic algorithm is presented to get the solution.Experiments show that this method can not only improve the information gain of topics,but also make the topics easier to understand by human.展开更多
基金supported by National Nature Science Foundation of China under Grant No.60905017,61072061National High Technical Research and Development Program of China(863 Program)under Grant No.2009AA01A346+1 种基金111 Project of China under Grant No.B08004the Special Project for Innovative Young Researchers of Beijing University of Posts and Telecommunications
文摘As a generative model,Latent Dirichlet Allocation Model,which lacks optimization of topics' discrimination capability focuses on how to generate data,This paper aims to improve the discrimination capability through unsupervised feature selection.Theoretical analysis shows that the discrimination capability of a topic is limited by the discrimination capability of its representative words.The discrimination capability of a word is approximated by the Information Gain of the word for topics,which is used to distinguish between "general word" and "special word" in LDA topics.Therefore,we add a constraint to the LDA objective function to let the "general words" only happen in "general topics" other than "special topics".Then a heuristic algorithm is presented to get the solution.Experiments show that this method can not only improve the information gain of topics,but also make the topics easier to understand by human.