The problem of "rich topics get richer"(RTGR) is popular to the topic models,which will bring the wrong topic distribution if the distributing process has not been intervened.In standard LDA(Latent Dirichlet...The problem of "rich topics get richer"(RTGR) is popular to the topic models,which will bring the wrong topic distribution if the distributing process has not been intervened.In standard LDA(Latent Dirichlet Allocation) model,each word in all the documents has the same statistical ability.In fact,the words have different impact towards different topics.Under the guidance of this thought,we extend ILDA(Infinite LDA) by considering the bias role of words to divide the topics.We propose a self-adaptive topic model to overcome the RTGR problem specifically.The model proposed in this paper is adapted to three questions:(1) the topic number is changeable with the collection of the documents,which is suitable for the dynamic data;(2) the words have discriminating attributes to topic distribution;(3) a selfadaptive method is used to realize the automatic re-sampling.To verify our model,we design a topic evolution analysis system which can realize the following functions:the topic classification in each cycle,the topic correlation in the adjacent cycles and the strength calculation of the sub topics in the order.The experiment both on NIPS corpus and our self-built news collections showed that the system could meet the given demand,the result was feasible.展开更多
Due to the slow processing speed of text topic clustering in stand-alone architecture under the background of big data,this paper takes news text as the research object and proposes LDA text topic clustering algorithm...Due to the slow processing speed of text topic clustering in stand-alone architecture under the background of big data,this paper takes news text as the research object and proposes LDA text topic clustering algorithm based on Spark big data platform.Since the TF-IDF(term frequency-inverse document frequency)algorithm under Spark is irreversible to word mapping,the mapped words indexes cannot be traced back to the original words.In this paper,an optimized method is proposed that TF-IDF under Spark to ensure the text words can be restored.Firstly,the text feature is extracted by the TF-IDF algorithm combined CountVectorizer proposed in this paper,and then the features are inputted to the LDA(Latent Dirichlet Allocation)topic model for training.Finally,the text topic clustering is obtained.Experimental results show that for large data samples,the processing speed of LDA topic model clustering has been improved based Spark.At the same time,compared with the LDA topic model based on word frequency input,the model proposed in this paper has a reduction of perplexity.展开更多
基金ACKNOWLEDGMENTS This work is supported by grants National 973 project (No.2013CB29606), Natural Science Foundation of China (No.61202244), research fund of ShangQiu Normal Colledge (No. 2013GGJS013). N1PS corpus is supported by SourceForge. We thank the anonymous reviewers for their helpful comments.
文摘The problem of "rich topics get richer"(RTGR) is popular to the topic models,which will bring the wrong topic distribution if the distributing process has not been intervened.In standard LDA(Latent Dirichlet Allocation) model,each word in all the documents has the same statistical ability.In fact,the words have different impact towards different topics.Under the guidance of this thought,we extend ILDA(Infinite LDA) by considering the bias role of words to divide the topics.We propose a self-adaptive topic model to overcome the RTGR problem specifically.The model proposed in this paper is adapted to three questions:(1) the topic number is changeable with the collection of the documents,which is suitable for the dynamic data;(2) the words have discriminating attributes to topic distribution;(3) a selfadaptive method is used to realize the automatic re-sampling.To verify our model,we design a topic evolution analysis system which can realize the following functions:the topic classification in each cycle,the topic correlation in the adjacent cycles and the strength calculation of the sub topics in the order.The experiment both on NIPS corpus and our self-built news collections showed that the system could meet the given demand,the result was feasible.
基金This work is supported by the Science Research Projects of Hunan Provincial Education Department(Nos.18A174,18C0262)the National Natural Science Foundation of China(No.61772561)+2 种基金the Key Research&Development Plan of Hunan Province(Nos.2018NK2012,2019SK2022)the Degree&Postgraduate Education Reform Project of Hunan Province(No.209)the Postgraduate Education and Teaching Reform Project of Central South Forestry University(No.2019JG013).
文摘Due to the slow processing speed of text topic clustering in stand-alone architecture under the background of big data,this paper takes news text as the research object and proposes LDA text topic clustering algorithm based on Spark big data platform.Since the TF-IDF(term frequency-inverse document frequency)algorithm under Spark is irreversible to word mapping,the mapped words indexes cannot be traced back to the original words.In this paper,an optimized method is proposed that TF-IDF under Spark to ensure the text words can be restored.Firstly,the text feature is extracted by the TF-IDF algorithm combined CountVectorizer proposed in this paper,and then the features are inputted to the LDA(Latent Dirichlet Allocation)topic model for training.Finally,the text topic clustering is obtained.Experimental results show that for large data samples,the processing speed of LDA topic model clustering has been improved based Spark.At the same time,compared with the LDA topic model based on word frequency input,the model proposed in this paper has a reduction of perplexity.