Most news topic detection methods use word-based methods,which easily ignore the relationship among words and have semantic sparsity,resulting in low topic detection accuracy.In addition,the current mainstream probabi...Most news topic detection methods use word-based methods,which easily ignore the relationship among words and have semantic sparsity,resulting in low topic detection accuracy.In addition,the current mainstream probability methods and graph analysis methods for topic detection have high time complexity.For these reasons,we present a news topic detection model on the basis of capsule semantic graph(CSG).The keywords that appear in each text at the same time are modeled as a keyword graph,which is divided into multiple subgraphs through community detection.Each subgraph contains a group of closely related keywords.The graph is used as the vertex of CSG.The semantic relationship among the vertices is obtained by calculating the similarity of the average word vector of each vertex.At the same time,the news text is clustered using the incremental clustering method,where each text uses CSG;that is,the similarity among texts is calculated by the graph kernel.The relationship between vertices and edges is also considered when calculating the similarity.Experimental results on three standard datasets show that CSG can obtain higher precision,recall,and F1 values than several latest methods.Experimental results on large-scale news datasets reveal that the time complexity of CSG is lower than that of probabilistic methods and other graph analysis methods.展开更多
Due to the slow processing speed of text topic clustering in stand-alone architecture under the background of big data,this paper takes news text as the research object and proposes LDA text topic clustering algorithm...Due to the slow processing speed of text topic clustering in stand-alone architecture under the background of big data,this paper takes news text as the research object and proposes LDA text topic clustering algorithm based on Spark big data platform.Since the TF-IDF(term frequency-inverse document frequency)algorithm under Spark is irreversible to word mapping,the mapped words indexes cannot be traced back to the original words.In this paper,an optimized method is proposed that TF-IDF under Spark to ensure the text words can be restored.Firstly,the text feature is extracted by the TF-IDF algorithm combined CountVectorizer proposed in this paper,and then the features are inputted to the LDA(Latent Dirichlet Allocation)topic model for training.Finally,the text topic clustering is obtained.Experimental results show that for large data samples,the processing speed of LDA topic model clustering has been improved based Spark.At the same time,compared with the LDA topic model based on word frequency input,the model proposed in this paper has a reduction of perplexity.展开更多
Hashtags are important metadata in microblogs and are used to mark topics or index messages. However,statistics show that hashtags are absent from most microblogs. This poses great challenges for the retrieval and ana...Hashtags are important metadata in microblogs and are used to mark topics or index messages. However,statistics show that hashtags are absent from most microblogs. This poses great challenges for the retrieval and analysis of these tagless microblogs. In this paper, we summarize the similarity between microblogs and shortmessage-style news, and then propose an algorithm, named 5WTAG, for detecting microblog topics based on a model of five Ws(When, Where, Who, What, ho W). As five-W attributes are the core components in event description, it is guaranteed theoretically that 5WTAG can properly extract semantic topics from microblogs. We introduce the detailed procedure of the algorithm in this paper including spam microblog identification, microblog segmentation, and candidate hashtag construction. In addition, we propose a novel recommendation computing method for ranking candidate hashtags, which combines syntax and semantic analysis and observes the distribution of artificial topic hashtags. Finally, we conduct comprehensive experiments to verify the semantic correctness and completeness of the candidate hashtags, as well as the accuracy of the recommendation method using real data from Sina Weibo.展开更多
文摘Most news topic detection methods use word-based methods,which easily ignore the relationship among words and have semantic sparsity,resulting in low topic detection accuracy.In addition,the current mainstream probability methods and graph analysis methods for topic detection have high time complexity.For these reasons,we present a news topic detection model on the basis of capsule semantic graph(CSG).The keywords that appear in each text at the same time are modeled as a keyword graph,which is divided into multiple subgraphs through community detection.Each subgraph contains a group of closely related keywords.The graph is used as the vertex of CSG.The semantic relationship among the vertices is obtained by calculating the similarity of the average word vector of each vertex.At the same time,the news text is clustered using the incremental clustering method,where each text uses CSG;that is,the similarity among texts is calculated by the graph kernel.The relationship between vertices and edges is also considered when calculating the similarity.Experimental results on three standard datasets show that CSG can obtain higher precision,recall,and F1 values than several latest methods.Experimental results on large-scale news datasets reveal that the time complexity of CSG is lower than that of probabilistic methods and other graph analysis methods.
基金This work is supported by the Science Research Projects of Hunan Provincial Education Department(Nos.18A174,18C0262)the National Natural Science Foundation of China(No.61772561)+2 种基金the Key Research&Development Plan of Hunan Province(Nos.2018NK2012,2019SK2022)the Degree&Postgraduate Education Reform Project of Hunan Province(No.209)the Postgraduate Education and Teaching Reform Project of Central South Forestry University(No.2019JG013).
文摘Due to the slow processing speed of text topic clustering in stand-alone architecture under the background of big data,this paper takes news text as the research object and proposes LDA text topic clustering algorithm based on Spark big data platform.Since the TF-IDF(term frequency-inverse document frequency)algorithm under Spark is irreversible to word mapping,the mapped words indexes cannot be traced back to the original words.In this paper,an optimized method is proposed that TF-IDF under Spark to ensure the text words can be restored.Firstly,the text feature is extracted by the TF-IDF algorithm combined CountVectorizer proposed in this paper,and then the features are inputted to the LDA(Latent Dirichlet Allocation)topic model for training.Finally,the text topic clustering is obtained.Experimental results show that for large data samples,the processing speed of LDA topic model clustering has been improved based Spark.At the same time,compared with the LDA topic model based on word frequency input,the model proposed in this paper has a reduction of perplexity.
基金supported by the National Natural Science Foundation of China (No. 61173027)the Northeastern University Fundamental Research Funds for the Central Universities (Nos. N150404012 and N140404006)
文摘Hashtags are important metadata in microblogs and are used to mark topics or index messages. However,statistics show that hashtags are absent from most microblogs. This poses great challenges for the retrieval and analysis of these tagless microblogs. In this paper, we summarize the similarity between microblogs and shortmessage-style news, and then propose an algorithm, named 5WTAG, for detecting microblog topics based on a model of five Ws(When, Where, Who, What, ho W). As five-W attributes are the core components in event description, it is guaranteed theoretically that 5WTAG can properly extract semantic topics from microblogs. We introduce the detailed procedure of the algorithm in this paper including spam microblog identification, microblog segmentation, and candidate hashtag construction. In addition, we propose a novel recommendation computing method for ranking candidate hashtags, which combines syntax and semantic analysis and observes the distribution of artificial topic hashtags. Finally, we conduct comprehensive experiments to verify the semantic correctness and completeness of the candidate hashtags, as well as the accuracy of the recommendation method using real data from Sina Weibo.