Due to the slow processing speed of text topic clustering in stand-alone architecture under the background of big data,this paper takes news text as the research object and proposes LDA text topic clustering algorithm...Due to the slow processing speed of text topic clustering in stand-alone architecture under the background of big data,this paper takes news text as the research object and proposes LDA text topic clustering algorithm based on Spark big data platform.Since the TF-IDF(term frequency-inverse document frequency)algorithm under Spark is irreversible to word mapping,the mapped words indexes cannot be traced back to the original words.In this paper,an optimized method is proposed that TF-IDF under Spark to ensure the text words can be restored.Firstly,the text feature is extracted by the TF-IDF algorithm combined CountVectorizer proposed in this paper,and then the features are inputted to the LDA(Latent Dirichlet Allocation)topic model for training.Finally,the text topic clustering is obtained.Experimental results show that for large data samples,the processing speed of LDA topic model clustering has been improved based Spark.At the same time,compared with the LDA topic model based on word frequency input,the model proposed in this paper has a reduction of perplexity.展开更多
Topic modeling is a mainstream and effective technology to deal with text data, with wide applications in text analysis, natural language, personalized recommendation, computer vision, etc. Among all the known topic m...Topic modeling is a mainstream and effective technology to deal with text data, with wide applications in text analysis, natural language, personalized recommendation, computer vision, etc. Among all the known topic models, supervised Latent Dirichlet Allocation (sLDA) is acknowledged as a popular and competitive supervised topic model. How- ever, the gradual increase of the scale of datasets makes sLDA more and more inefficient and time-consuming, and limits its applications in a very narrow range. To solve it, a parallel online sLDA, named PO-sLDA (Parallel and Online sLDA), is proposed in this study. It uses the stochastic variational inference as the learning method to make the training procedure more rapid and efficient, and a parallel computing mechanism implemented via the MapReduce framework is proposed to promote the capacity of cloud computing and big data processing. The online training capacity supported by PO-sLDA expands the application scope of this approach, making it instrumental for real-life applications with high real-time demand. The validation using two datasets with different sizes shows that the proposed approach has the comparative accuracy as the sLDA and can efficiently accelerate the training procedure. Moreover, its good convergence and online training capacity make it lucrative for the large-scale text data analyzing and processing.展开更多
基金This work is supported by the Science Research Projects of Hunan Provincial Education Department(Nos.18A174,18C0262)the National Natural Science Foundation of China(No.61772561)+2 种基金the Key Research&Development Plan of Hunan Province(Nos.2018NK2012,2019SK2022)the Degree&Postgraduate Education Reform Project of Hunan Province(No.209)the Postgraduate Education and Teaching Reform Project of Central South Forestry University(No.2019JG013).
文摘Due to the slow processing speed of text topic clustering in stand-alone architecture under the background of big data,this paper takes news text as the research object and proposes LDA text topic clustering algorithm based on Spark big data platform.Since the TF-IDF(term frequency-inverse document frequency)algorithm under Spark is irreversible to word mapping,the mapped words indexes cannot be traced back to the original words.In this paper,an optimized method is proposed that TF-IDF under Spark to ensure the text words can be restored.Firstly,the text feature is extracted by the TF-IDF algorithm combined CountVectorizer proposed in this paper,and then the features are inputted to the LDA(Latent Dirichlet Allocation)topic model for training.Finally,the text topic clustering is obtained.Experimental results show that for large data samples,the processing speed of LDA topic model clustering has been improved based Spark.At the same time,compared with the LDA topic model based on word frequency input,the model proposed in this paper has a reduction of perplexity.
基金This work was supported in part by the National Natural Science Foundation of China under Grant Nos. 61572226 and 61876069, and the Key Scientific and Technological Research and Development Project of Jilin Province of China under Grant Nos. 20180201067GX and 20180201044GX.
文摘Topic modeling is a mainstream and effective technology to deal with text data, with wide applications in text analysis, natural language, personalized recommendation, computer vision, etc. Among all the known topic models, supervised Latent Dirichlet Allocation (sLDA) is acknowledged as a popular and competitive supervised topic model. How- ever, the gradual increase of the scale of datasets makes sLDA more and more inefficient and time-consuming, and limits its applications in a very narrow range. To solve it, a parallel online sLDA, named PO-sLDA (Parallel and Online sLDA), is proposed in this study. It uses the stochastic variational inference as the learning method to make the training procedure more rapid and efficient, and a parallel computing mechanism implemented via the MapReduce framework is proposed to promote the capacity of cloud computing and big data processing. The online training capacity supported by PO-sLDA expands the application scope of this approach, making it instrumental for real-life applications with high real-time demand. The validation using two datasets with different sizes shows that the proposed approach has the comparative accuracy as the sLDA and can efficiently accelerate the training procedure. Moreover, its good convergence and online training capacity make it lucrative for the large-scale text data analyzing and processing.