As a generative model,Latent Dirichlet Allocation Model,which lacks optimization of topics' discrimination capability focuses on how to generate data,This paper aims to improve the discrimination capability throug...As a generative model,Latent Dirichlet Allocation Model,which lacks optimization of topics' discrimination capability focuses on how to generate data,This paper aims to improve the discrimination capability through unsupervised feature selection.Theoretical analysis shows that the discrimination capability of a topic is limited by the discrimination capability of its representative words.The discrimination capability of a word is approximated by the Information Gain of the word for topics,which is used to distinguish between "general word" and "special word" in LDA topics.Therefore,we add a constraint to the LDA objective function to let the "general words" only happen in "general topics" other than "special topics".Then a heuristic algorithm is presented to get the solution.Experiments show that this method can not only improve the information gain of topics,but also make the topics easier to understand by human.展开更多
The growth of cloud in modern technology is drastic by provisioning services to various industries where data security is considered to be common issue that influences the intrusion detection system(IDS).IDS are consi...The growth of cloud in modern technology is drastic by provisioning services to various industries where data security is considered to be common issue that influences the intrusion detection system(IDS).IDS are considered as an essential factor to fulfill security requirements.Recently,there are diverse Machine Learning(ML)approaches that are used for modeling effectual IDS.Most IDS are based on ML techniques and categorized as supervised and unsupervised.However,IDS with supervised learning is based on labeled data.This is considered as a common drawback and it fails to identify the attack patterns.Similarly,unsupervised learning fails to provide satisfactory outcomes.Therefore,this work concentrates on semi-supervised learning model known as Fuzzy based semi-supervised approach through Latent Dirichlet Allocation(F-LDA)for intrusion detection in cloud system.This helps to resolve the aforementioned challenges.Initially,LDA gives better generalization ability for training the labeled data.Similarly,to handle the unlabelled data,Fuzzy model has been adopted for analyzing the dataset.Here,preprocessing has been carried out to eliminate data redundancy over network dataset.In order to validate the efficiency of F-LDA towards ID,this model is tested under NSL-KDD cup dataset is a common traffic dataset.Simulation is done inMATLAB environment and gives better accuracy while comparing with benchmark standard dataset.The proposed F-LDAgives better accuracy and promising outcomes than the prevailing approaches.展开更多
Government policy-group integration and policy-chain inference are significant to the execution of strategies in current Chinese society.Specifically,the coordination of hierarchical policies implemented among governm...Government policy-group integration and policy-chain inference are significant to the execution of strategies in current Chinese society.Specifically,the coordination of hierarchical policies implemented among government departments is one of the key challenges to rural revitalization.In recent years,various well-established quantitative methods have been proposed to evaluate policy coordination,but the majority of these relied on manual analysis,which can lead to subjective results.Thus,in this paper,a novel approach called“policy knowledge graph for the coordination among the government departments”(PG-CODE)is proposed,which incorporates topic modeling into policy knowledge graphs.Similar to a knowledge graph,a policy knowledge graph uses a graph-structured data model to integrate policy discourse.With latent Dirichlet allocation embedding,a policy knowledge graph could capture the underlying topics of the policies.Furthermore,coordination strength and topic diffusion among hierarchical departments could be inferred from the PG-CODE,as it can provide a better representation of coordination within the policy space.We implemented and evaluated the PG-CODE in the field of rural innovation and entrepreneurship policy,and the results effectively demonstrate improved coordination among departments.展开更多
The Product Sensitive Online Dirichlet Allocation model(PSOLDA)proposed in this paper mainly uses the sentiment polarity of topic words in the review text to improve the accuracy of topic evolution.First,we use Latent...The Product Sensitive Online Dirichlet Allocation model(PSOLDA)proposed in this paper mainly uses the sentiment polarity of topic words in the review text to improve the accuracy of topic evolution.First,we use Latent Dirichlet Allocation(LDA)to obtain the distribution of topic words in the current time window.Second,the word2 vec word vector is used as auxiliary information to determine the sentiment polarity and obtain the sentiment polarity distribution of the current topic.Finally,the sentiment polarity changes of the topics in the previous and next time window are mapped to the sentiment factors,and the distribution of topic words in the next time window is controlled through them.The experimental results show that the PSOLDA model decreases the probability distribution by 0.1601,while Online Twitter LDA only increases by 0.0699.The topic evolution method that integrates the sentimental information of topic words proposed in this paper is better than the traditional model.展开更多
Latent Dirichlet allocation(LDA)is a topic model widely used for discovering hidden semantics in massive text corpora.Collapsed Gibbs sampling(CGS),as a widely-used algorithm for learning the parameters of LDA,has the...Latent Dirichlet allocation(LDA)is a topic model widely used for discovering hidden semantics in massive text corpora.Collapsed Gibbs sampling(CGS),as a widely-used algorithm for learning the parameters of LDA,has the risk of privacy leakage.Specifically,word count statistics and updates of latent topics in CGS,which are essential for parameter estimation,could be employed by adversaries to conduct effective membership inference attacks(MIAs).Till now,there are two kinds of methods exploited in CGS to defend against MIAs:adding noise to word count statistics and utilizing inherent privacy.These two kinds of methods have their respective limitations.Noise sampled from the Laplacian distribution sometimes produces negative word count statistics,which render terrible parameter estimation in CGS.Utilizing inherent privacy could only provide weak guaranteed privacy when defending against MIAs.It is promising to propose an effective framework to obtain accurate parameter estimations with guaranteed differential privacy.The key issue of obtaining accurate parameter estimations when introducing differential privacy in CGS is making good use of the privacy budget such that a precise noise scale is derived.It is the first time that R′enyi differential privacy(RDP)has been introduced into CGS and we propose RDP-LDA,an effective framework for analyzing the privacy loss of any differentially private CGS.RDP-LDA could be used to derive a tighter upper bound of privacy loss than the overestimated results of existing differentially private CGS obtained byε-DP.In RDP-LDA,we propose a novel truncated-Gaussian mechanism that keeps word count statistics non-negative.And we propose distribution perturbation which could provide more rigorous guaranteed privacy than utilizing inherent privacy.Experiments validate that our proposed methods produce more accurate parameter estimation under the JS-divergence metric and obtain lower precision and recall when defending against MIAs.展开更多
Previous work on the one-class collaborative filtering (OCCF) problem can be roughly categorized into pointwise methods, pairwise methods, and content-based methods. A fundamental assumption of these approaches is t...Previous work on the one-class collaborative filtering (OCCF) problem can be roughly categorized into pointwise methods, pairwise methods, and content-based methods. A fundamental assumption of these approaches is that all missing values in the user-item rating matrix are considered negative. However, this assumption may not hold because the missing values may contain negative and positive examples. For example, a user who fails to give positive feedback about an item may not necessarily dislike it; he may simply be unfamiliar with it. Meanwhile, content-based methods, e.g. collaborative topic regression (CTR), usually require textual content information of the items, and thus their applicability is largely limited when the text information is not available. In this paper, we propose to apply the latent Dirichlet allocation (LDA) model on OCCF to address the above-mentioned problems. The basic idea of this approach is that items are regarded as words, users are considered as documents, and the user-item feedback matrix constitutes the corpus. Our model drops the strong assumption that missing values are all negative and only utilizes the observed data to predict a user's interest. Additionally, the proposed model does not need content information of the items. Experimental results indicate that the proposed method outperforms previous methods on various ranking-oriented evaluation metrics.We further combine this method with a matrix factorizationbased method to tackle the multi-class collaborative filtering (MCCF) problem, which also achieves better performance on predicting user ratings.展开更多
The rapid expansion of online content and big data has precipitated an urgent need for efficient summarization techniques to swiftly comprehend vast textual documents without compromising their original integrity.Curr...The rapid expansion of online content and big data has precipitated an urgent need for efficient summarization techniques to swiftly comprehend vast textual documents without compromising their original integrity.Current approaches in Extractive Text Summarization(ETS)leverage the modeling of inter-sentence relationships,a task of paramount importance in producing coherent summaries.This study introduces an innovative model that integrates Graph Attention Networks(GATs)with Transformer-based Bidirectional Encoder Representa-tions from Transformers(BERT)and Latent Dirichlet Allocation(LDA),further enhanced by Term Frequency-Inverse Document Frequency(TF-IDF)values,to improve sentence selection by capturing comprehensive topical information.Our approach constructs a graph with nodes representing sentences,words,and topics,thereby elevating the interconnectivity and enabling a more refined understanding of text structures.This model is stretched to Multi-Document Summarization(MDS)from Single-Document Summarization,offering significant improvements over existing models such as THGS-GMM and Topic-GraphSum,as demonstrated by empirical evaluations on benchmark news datasets like Cable News Network(CNN)/Daily Mail(DM)and Multi-News.The results consistently demonstrate superior performance,showcasing the model’s robustness in handling complex summarization tasks across single and multi-document contexts.This research not only advances the integration of BERT and LDA within a GATs but also emphasizes our model’s capacity to effectively manage global information and adapt to diverse summarization challenges.展开更多
This study undertakes a thorough analysis of the sentiment within the r/Corona-virus subreddit community regarding COVID-19 vaccines on Reddit. We meticulously collected and processed 34,768 comments, spanning from No...This study undertakes a thorough analysis of the sentiment within the r/Corona-virus subreddit community regarding COVID-19 vaccines on Reddit. We meticulously collected and processed 34,768 comments, spanning from November 20, 2020, to January 17, 2021, using sentiment calculation methods such as TextBlob and Twitter-RoBERTa-Base-sentiment to categorize comments into positive, negative, or neutral sentiments. The methodology involved the use of Count Vectorizer as a vectorization technique and the implementation of advanced ensemble algorithms like XGBoost and Random Forest, achieving an accuracy of approximately 80%. Furthermore, through the Dirichlet latent allocation, we identified 23 distinct reasons for vaccine distrust among negative comments. These findings are crucial for understanding the community’s attitudes towards vaccination and can guide targeted public health messaging. Our study not only provides insights into public opinion during a critical health crisis, but also demonstrates the effectiveness of combining natural language processing tools and ensemble algorithms in sentiment analysis.展开更多
Due to the slow processing speed of text topic clustering in stand-alone architecture under the background of big data,this paper takes news text as the research object and proposes LDA text topic clustering algorithm...Due to the slow processing speed of text topic clustering in stand-alone architecture under the background of big data,this paper takes news text as the research object and proposes LDA text topic clustering algorithm based on Spark big data platform.Since the TF-IDF(term frequency-inverse document frequency)algorithm under Spark is irreversible to word mapping,the mapped words indexes cannot be traced back to the original words.In this paper,an optimized method is proposed that TF-IDF under Spark to ensure the text words can be restored.Firstly,the text feature is extracted by the TF-IDF algorithm combined CountVectorizer proposed in this paper,and then the features are inputted to the LDA(Latent Dirichlet Allocation)topic model for training.Finally,the text topic clustering is obtained.Experimental results show that for large data samples,the processing speed of LDA topic model clustering has been improved based Spark.At the same time,compared with the LDA topic model based on word frequency input,the model proposed in this paper has a reduction of perplexity.展开更多
We presented a novel framework for automatic behavior clustering and unsupervised anomaly detection in a large video set. The framework consisted of the following key components: 1 ) Drawing from natural language pr...We presented a novel framework for automatic behavior clustering and unsupervised anomaly detection in a large video set. The framework consisted of the following key components: 1 ) Drawing from natural language processing, we introduced a compact and effective behavior representation method as a stochastic sequence of spatiotemporal events, where we analyzed the global structural information of behaviors using their local action statistics. 2) The natural grouping of behavior patterns was discovered through a novel clustering algorithm. 3 ) A run-time accumulative anomaly measure was introduced to detect abnormal behavior, whereas normal behavior patterns were recognized when sufficient visual evidence had become available based on an online Likelihood Ratio Test (LRT) method. This ensured robust and reliable anomaly detection and normal behavior recognition at the shortest possible time. Experimental results demonstrated the effectiveness and robustness of our approach using noisy and sparse data sets collected from a real surveillance scenario.展开更多
The problem of "rich topics get richer"(RTGR) is popular to the topic models,which will bring the wrong topic distribution if the distributing process has not been intervened.In standard LDA(Latent Dirichlet...The problem of "rich topics get richer"(RTGR) is popular to the topic models,which will bring the wrong topic distribution if the distributing process has not been intervened.In standard LDA(Latent Dirichlet Allocation) model,each word in all the documents has the same statistical ability.In fact,the words have different impact towards different topics.Under the guidance of this thought,we extend ILDA(Infinite LDA) by considering the bias role of words to divide the topics.We propose a self-adaptive topic model to overcome the RTGR problem specifically.The model proposed in this paper is adapted to three questions:(1) the topic number is changeable with the collection of the documents,which is suitable for the dynamic data;(2) the words have discriminating attributes to topic distribution;(3) a selfadaptive method is used to realize the automatic re-sampling.To verify our model,we design a topic evolution analysis system which can realize the following functions:the topic classification in each cycle,the topic correlation in the adjacent cycles and the strength calculation of the sub topics in the order.The experiment both on NIPS corpus and our self-built news collections showed that the system could meet the given demand,the result was feasible.展开更多
Phishing is the act of attempting to steal a user’s financial and personal information, such as credit card numbers and passwords by pretending to be a trustworthy participant, during online communication. Attackers ...Phishing is the act of attempting to steal a user’s financial and personal information, such as credit card numbers and passwords by pretending to be a trustworthy participant, during online communication. Attackers may direct the users to a fake website that could seem legitimate, and then gather useful and confidential information using that site. In order to protect users from Social Engineering techniques such as phishing, various measures have been developed, including improvement of Technical Security. In this paper, we propose a new technique, namely, “A Prediction Model for the Detection of Phishing e-mails using Topic Modelling, Named Entity Recognition and Image Processing”. The features extracted are Topic Modelling features, Named Entity features and Structural features. A multi-classifier prediction model is used to detect the phishing mails. Experimental results show that the multi-classification technique outperforms the single-classifier-based prediction techniques. The resultant accuracy of the detection of phishing e-mail is 99% with the highest False Positive Rate being 2.1%.展开更多
If progress is to be made toward improving geohazard management and emergency decision-making,then lessons need to be learned from past geohazard information.A geologic hazard report provides a useful and reliable sou...If progress is to be made toward improving geohazard management and emergency decision-making,then lessons need to be learned from past geohazard information.A geologic hazard report provides a useful and reliable source of information about the occurrence of an event,along with detailed information about the condition or factors of the geohazard.Analyzing such reports,however,can be a challenging process because these texts are often presented in unstructured long text formats,and contain rich specialized and detailed information.Automatically text classification is commonly used to mine disaster text data in open domains(e.g.,news and microblogs).But it has limitations to performing contextual long-distance dependencies and is insensitive to discourse order.These deficiencies are most obviously exposed in long text fields.Therefore,this paper uses the bidirectional encoder representations from Transformers(BERT),to model long text.Then,utilizing a softmax layer to automatically extract text features and classify geohazards without manual features.The latent Dirichlet allocation(LDA)model is used to examine the interdependencies that exist between causal variables to visualize geohazards.The proposed method is useful in enabling the machine-assisted interpretation of text-based geohazards.Moreover,it can help users visualize causes,processes,and other geohazards and assist decision-makers in emergency responses.展开更多
With the progress and development of computer technology,applying machine learning methods to cancer research has become an important research field.To analyze the most recent research status and trends,main research ...With the progress and development of computer technology,applying machine learning methods to cancer research has become an important research field.To analyze the most recent research status and trends,main research topics,topic evolutions,research collaborations,and potential directions of this research field,this study conducts a bibliometric analysis on 6206 research articles worldwide collected from PubMed between 2011 and 2021 concerning cancer research using machine learning methods.Python is used as a tool for bibliometric analysis,Gephi is used for social network analysis,and the Latent Dirichlet Allocation model is used for topic modeling.The trend analysis of articles not only reflects the innovative research at the intersection of machine learning and cancer but also demonstrates its vigorous development and increasing impacts.In terms of journals,Nature Communications is the most influential journal and Scientific Reports is the most prolific one.The United States and Harvard University have contributed the most to cancer research using machine learning methods.As for the research topic,“Support Vector Machine,”“classification,”and“deep learning”have been the core focuses of the research field.Findings are helpful for scholars and related practitioners to better understand the development status and trends of cancer research using machine learning methods,as well as to have a deeper understanding of research hotspots.展开更多
Induction of common knowledge or regularities from large-scale clinical data is a vital task for Chinese medicine(CM).In this paper,we propose a data mining method,called the Symptom-Herb-Diagnosis topic(SHDT) mod...Induction of common knowledge or regularities from large-scale clinical data is a vital task for Chinese medicine(CM).In this paper,we propose a data mining method,called the Symptom-Herb-Diagnosis topic(SHDT) model,to automatically extract the common relationships among symptoms,herb combinations and diagnoses from large-scale CM clinical data.The SHDT model is one of the multi-relational extensions of the latent topic model,which can acquire topic structure from discrete corpora(such as document collection) by capturing the semantic relations among words.We applied the SHDT model to discover the common CM diagnosis and treatment knowledge for type 2 diabetes mellitus(T2DM) using 3 238 inpatient cases.We obtained meaningful diagnosis and treatment topics(clusters) from the data,which clinically indicated some important medical groups corresponding to comorbidity diseases(e.g.,heart disease and diabetic kidney diseases in T2DM inpatients).The results show that manifestation sub-categories actually exist in T2DM patients that need specific,individualised CM therapies.Furthermore,the results demonstrate that this method is helpful for generating CM clinical guidelines for T2DM based on structured collected clinical data.展开更多
基金supported by National Nature Science Foundation of China under Grant No.60905017,61072061National High Technical Research and Development Program of China(863 Program)under Grant No.2009AA01A346+1 种基金111 Project of China under Grant No.B08004the Special Project for Innovative Young Researchers of Beijing University of Posts and Telecommunications
文摘As a generative model,Latent Dirichlet Allocation Model,which lacks optimization of topics' discrimination capability focuses on how to generate data,This paper aims to improve the discrimination capability through unsupervised feature selection.Theoretical analysis shows that the discrimination capability of a topic is limited by the discrimination capability of its representative words.The discrimination capability of a word is approximated by the Information Gain of the word for topics,which is used to distinguish between "general word" and "special word" in LDA topics.Therefore,we add a constraint to the LDA objective function to let the "general words" only happen in "general topics" other than "special topics".Then a heuristic algorithm is presented to get the solution.Experiments show that this method can not only improve the information gain of topics,but also make the topics easier to understand by human.
文摘The growth of cloud in modern technology is drastic by provisioning services to various industries where data security is considered to be common issue that influences the intrusion detection system(IDS).IDS are considered as an essential factor to fulfill security requirements.Recently,there are diverse Machine Learning(ML)approaches that are used for modeling effectual IDS.Most IDS are based on ML techniques and categorized as supervised and unsupervised.However,IDS with supervised learning is based on labeled data.This is considered as a common drawback and it fails to identify the attack patterns.Similarly,unsupervised learning fails to provide satisfactory outcomes.Therefore,this work concentrates on semi-supervised learning model known as Fuzzy based semi-supervised approach through Latent Dirichlet Allocation(F-LDA)for intrusion detection in cloud system.This helps to resolve the aforementioned challenges.Initially,LDA gives better generalization ability for training the labeled data.Similarly,to handle the unlabelled data,Fuzzy model has been adopted for analyzing the dataset.Here,preprocessing has been carried out to eliminate data redundancy over network dataset.In order to validate the efficiency of F-LDA towards ID,this model is tested under NSL-KDD cup dataset is a common traffic dataset.Simulation is done inMATLAB environment and gives better accuracy while comparing with benchmark standard dataset.The proposed F-LDAgives better accuracy and promising outcomes than the prevailing approaches.
基金supported by the National Social Science Fund of China(No.20BGL231)the Natural Science Foundation of Hubei Province(No.2018CFB380)。
文摘Government policy-group integration and policy-chain inference are significant to the execution of strategies in current Chinese society.Specifically,the coordination of hierarchical policies implemented among government departments is one of the key challenges to rural revitalization.In recent years,various well-established quantitative methods have been proposed to evaluate policy coordination,but the majority of these relied on manual analysis,which can lead to subjective results.Thus,in this paper,a novel approach called“policy knowledge graph for the coordination among the government departments”(PG-CODE)is proposed,which incorporates topic modeling into policy knowledge graphs.Similar to a knowledge graph,a policy knowledge graph uses a graph-structured data model to integrate policy discourse.With latent Dirichlet allocation embedding,a policy knowledge graph could capture the underlying topics of the policies.Furthermore,coordination strength and topic diffusion among hierarchical departments could be inferred from the PG-CODE,as it can provide a better representation of coordination within the policy space.We implemented and evaluated the PG-CODE in the field of rural innovation and entrepreneurship policy,and the results effectively demonstrate improved coordination among departments.
基金Supported by the Opening Project of Shanghai Key Laboratory of Integrated Administration Technologies for Information Security(AGK2019004)Songjiang District Science and Technology Research Project(19SJKJGG83)National Natural Science Foundation of China(61802251)。
文摘The Product Sensitive Online Dirichlet Allocation model(PSOLDA)proposed in this paper mainly uses the sentiment polarity of topic words in the review text to improve the accuracy of topic evolution.First,we use Latent Dirichlet Allocation(LDA)to obtain the distribution of topic words in the current time window.Second,the word2 vec word vector is used as auxiliary information to determine the sentiment polarity and obtain the sentiment polarity distribution of the current topic.Finally,the sentiment polarity changes of the topics in the previous and next time window are mapped to the sentiment factors,and the distribution of topic words in the next time window is controlled through them.The experimental results show that the PSOLDA model decreases the probability distribution by 0.1601,while Online Twitter LDA only increases by 0.0699.The topic evolution method that integrates the sentimental information of topic words proposed in this paper is better than the traditional model.
基金the National Natural Science Foundation of China under Grant Nos.62072460,62076245,and 62172424the Beijing Natural Science Foundation under Grant No.4212022.
文摘Latent Dirichlet allocation(LDA)is a topic model widely used for discovering hidden semantics in massive text corpora.Collapsed Gibbs sampling(CGS),as a widely-used algorithm for learning the parameters of LDA,has the risk of privacy leakage.Specifically,word count statistics and updates of latent topics in CGS,which are essential for parameter estimation,could be employed by adversaries to conduct effective membership inference attacks(MIAs).Till now,there are two kinds of methods exploited in CGS to defend against MIAs:adding noise to word count statistics and utilizing inherent privacy.These two kinds of methods have their respective limitations.Noise sampled from the Laplacian distribution sometimes produces negative word count statistics,which render terrible parameter estimation in CGS.Utilizing inherent privacy could only provide weak guaranteed privacy when defending against MIAs.It is promising to propose an effective framework to obtain accurate parameter estimations with guaranteed differential privacy.The key issue of obtaining accurate parameter estimations when introducing differential privacy in CGS is making good use of the privacy budget such that a precise noise scale is derived.It is the first time that R′enyi differential privacy(RDP)has been introduced into CGS and we propose RDP-LDA,an effective framework for analyzing the privacy loss of any differentially private CGS.RDP-LDA could be used to derive a tighter upper bound of privacy loss than the overestimated results of existing differentially private CGS obtained byε-DP.In RDP-LDA,we propose a novel truncated-Gaussian mechanism that keeps word count statistics non-negative.And we propose distribution perturbation which could provide more rigorous guaranteed privacy than utilizing inherent privacy.Experiments validate that our proposed methods produce more accurate parameter estimation under the JS-divergence metric and obtain lower precision and recall when defending against MIAs.
基金We greatly appreciate Weike Pan for his codes of algorithm GBPR[1], which makes us able to evaluate the algorithm more efficiently and more fairly. This work was supported by the National Natural Science Foundation of China (NSFC) (Grant Nos. 61370126, 61672081, 71540028, 61571052, 61602237), National High-tech R&D Program of China (2015AA016004), Beijing Advanced Innovation Center for Imaging Technology (BAICIT-2016001), the Fund of the State Key Laboratory of Software Development Environment (SKLSDE-2013ZX-19), the Fund of Beijing Social Science (14JGC103), the Statistics Research Project of National Bureau (2013LY055), and the Fund of Beijing Wuzi University, China (GJB20141002).
文摘Previous work on the one-class collaborative filtering (OCCF) problem can be roughly categorized into pointwise methods, pairwise methods, and content-based methods. A fundamental assumption of these approaches is that all missing values in the user-item rating matrix are considered negative. However, this assumption may not hold because the missing values may contain negative and positive examples. For example, a user who fails to give positive feedback about an item may not necessarily dislike it; he may simply be unfamiliar with it. Meanwhile, content-based methods, e.g. collaborative topic regression (CTR), usually require textual content information of the items, and thus their applicability is largely limited when the text information is not available. In this paper, we propose to apply the latent Dirichlet allocation (LDA) model on OCCF to address the above-mentioned problems. The basic idea of this approach is that items are regarded as words, users are considered as documents, and the user-item feedback matrix constitutes the corpus. Our model drops the strong assumption that missing values are all negative and only utilizes the observed data to predict a user's interest. Additionally, the proposed model does not need content information of the items. Experimental results indicate that the proposed method outperforms previous methods on various ranking-oriented evaluation metrics.We further combine this method with a matrix factorizationbased method to tackle the multi-class collaborative filtering (MCCF) problem, which also achieves better performance on predicting user ratings.
文摘The rapid expansion of online content and big data has precipitated an urgent need for efficient summarization techniques to swiftly comprehend vast textual documents without compromising their original integrity.Current approaches in Extractive Text Summarization(ETS)leverage the modeling of inter-sentence relationships,a task of paramount importance in producing coherent summaries.This study introduces an innovative model that integrates Graph Attention Networks(GATs)with Transformer-based Bidirectional Encoder Representa-tions from Transformers(BERT)and Latent Dirichlet Allocation(LDA),further enhanced by Term Frequency-Inverse Document Frequency(TF-IDF)values,to improve sentence selection by capturing comprehensive topical information.Our approach constructs a graph with nodes representing sentences,words,and topics,thereby elevating the interconnectivity and enabling a more refined understanding of text structures.This model is stretched to Multi-Document Summarization(MDS)from Single-Document Summarization,offering significant improvements over existing models such as THGS-GMM and Topic-GraphSum,as demonstrated by empirical evaluations on benchmark news datasets like Cable News Network(CNN)/Daily Mail(DM)and Multi-News.The results consistently demonstrate superior performance,showcasing the model’s robustness in handling complex summarization tasks across single and multi-document contexts.This research not only advances the integration of BERT and LDA within a GATs but also emphasizes our model’s capacity to effectively manage global information and adapt to diverse summarization challenges.
文摘This study undertakes a thorough analysis of the sentiment within the r/Corona-virus subreddit community regarding COVID-19 vaccines on Reddit. We meticulously collected and processed 34,768 comments, spanning from November 20, 2020, to January 17, 2021, using sentiment calculation methods such as TextBlob and Twitter-RoBERTa-Base-sentiment to categorize comments into positive, negative, or neutral sentiments. The methodology involved the use of Count Vectorizer as a vectorization technique and the implementation of advanced ensemble algorithms like XGBoost and Random Forest, achieving an accuracy of approximately 80%. Furthermore, through the Dirichlet latent allocation, we identified 23 distinct reasons for vaccine distrust among negative comments. These findings are crucial for understanding the community’s attitudes towards vaccination and can guide targeted public health messaging. Our study not only provides insights into public opinion during a critical health crisis, but also demonstrates the effectiveness of combining natural language processing tools and ensemble algorithms in sentiment analysis.
基金This work is supported by the Science Research Projects of Hunan Provincial Education Department(Nos.18A174,18C0262)the National Natural Science Foundation of China(No.61772561)+2 种基金the Key Research&Development Plan of Hunan Province(Nos.2018NK2012,2019SK2022)the Degree&Postgraduate Education Reform Project of Hunan Province(No.209)the Postgraduate Education and Teaching Reform Project of Central South Forestry University(No.2019JG013).
文摘Due to the slow processing speed of text topic clustering in stand-alone architecture under the background of big data,this paper takes news text as the research object and proposes LDA text topic clustering algorithm based on Spark big data platform.Since the TF-IDF(term frequency-inverse document frequency)algorithm under Spark is irreversible to word mapping,the mapped words indexes cannot be traced back to the original words.In this paper,an optimized method is proposed that TF-IDF under Spark to ensure the text words can be restored.Firstly,the text feature is extracted by the TF-IDF algorithm combined CountVectorizer proposed in this paper,and then the features are inputted to the LDA(Latent Dirichlet Allocation)topic model for training.Finally,the text topic clustering is obtained.Experimental results show that for large data samples,the processing speed of LDA topic model clustering has been improved based Spark.At the same time,compared with the LDA topic model based on word frequency input,the model proposed in this paper has a reduction of perplexity.
基金This work is supported by National Natural Science Foundation of China (NSFC) under Grant No. 60573139 andNational Science & Technology Pillar Program of China under Grant NO. 2008BAH221303.
文摘We presented a novel framework for automatic behavior clustering and unsupervised anomaly detection in a large video set. The framework consisted of the following key components: 1 ) Drawing from natural language processing, we introduced a compact and effective behavior representation method as a stochastic sequence of spatiotemporal events, where we analyzed the global structural information of behaviors using their local action statistics. 2) The natural grouping of behavior patterns was discovered through a novel clustering algorithm. 3 ) A run-time accumulative anomaly measure was introduced to detect abnormal behavior, whereas normal behavior patterns were recognized when sufficient visual evidence had become available based on an online Likelihood Ratio Test (LRT) method. This ensured robust and reliable anomaly detection and normal behavior recognition at the shortest possible time. Experimental results demonstrated the effectiveness and robustness of our approach using noisy and sparse data sets collected from a real surveillance scenario.
基金ACKNOWLEDGMENTS This work is supported by grants National 973 project (No.2013CB29606), Natural Science Foundation of China (No.61202244), research fund of ShangQiu Normal Colledge (No. 2013GGJS013). N1PS corpus is supported by SourceForge. We thank the anonymous reviewers for their helpful comments.
文摘The problem of "rich topics get richer"(RTGR) is popular to the topic models,which will bring the wrong topic distribution if the distributing process has not been intervened.In standard LDA(Latent Dirichlet Allocation) model,each word in all the documents has the same statistical ability.In fact,the words have different impact towards different topics.Under the guidance of this thought,we extend ILDA(Infinite LDA) by considering the bias role of words to divide the topics.We propose a self-adaptive topic model to overcome the RTGR problem specifically.The model proposed in this paper is adapted to three questions:(1) the topic number is changeable with the collection of the documents,which is suitable for the dynamic data;(2) the words have discriminating attributes to topic distribution;(3) a selfadaptive method is used to realize the automatic re-sampling.To verify our model,we design a topic evolution analysis system which can realize the following functions:the topic classification in each cycle,the topic correlation in the adjacent cycles and the strength calculation of the sub topics in the order.The experiment both on NIPS corpus and our self-built news collections showed that the system could meet the given demand,the result was feasible.
文摘Phishing is the act of attempting to steal a user’s financial and personal information, such as credit card numbers and passwords by pretending to be a trustworthy participant, during online communication. Attackers may direct the users to a fake website that could seem legitimate, and then gather useful and confidential information using that site. In order to protect users from Social Engineering techniques such as phishing, various measures have been developed, including improvement of Technical Security. In this paper, we propose a new technique, namely, “A Prediction Model for the Detection of Phishing e-mails using Topic Modelling, Named Entity Recognition and Image Processing”. The features extracted are Topic Modelling features, Named Entity features and Structural features. A multi-classifier prediction model is used to detect the phishing mails. Experimental results show that the multi-classification technique outperforms the single-classifier-based prediction techniques. The resultant accuracy of the detection of phishing e-mail is 99% with the highest False Positive Rate being 2.1%.
基金supported by the Natural Science Foundation of China(No.42301492)the National Key Research and Development Program(No.2022YFB3904200)+4 种基金the Open Fund of Key Laboratory of Urban Land Resources Monitoring and Simulation,Ministry of Natural Resources(No.KF-2022-07-014)the Natural Science Foundation of Hubei Province of China(No.2022CFB640)the Open Fund of Hubei Key Laboratory of Intelligent Vision Based Monitoring for Hydroelectric Engineering(No.2022SDSJ04)the Opening Fund of Key Laboratory of Geological Survey and Evaluation of Ministry of Education(No.GLAB 2023ZR01)the Fundamental Research Funds for the Central Universities.
文摘If progress is to be made toward improving geohazard management and emergency decision-making,then lessons need to be learned from past geohazard information.A geologic hazard report provides a useful and reliable source of information about the occurrence of an event,along with detailed information about the condition or factors of the geohazard.Analyzing such reports,however,can be a challenging process because these texts are often presented in unstructured long text formats,and contain rich specialized and detailed information.Automatically text classification is commonly used to mine disaster text data in open domains(e.g.,news and microblogs).But it has limitations to performing contextual long-distance dependencies and is insensitive to discourse order.These deficiencies are most obviously exposed in long text fields.Therefore,this paper uses the bidirectional encoder representations from Transformers(BERT),to model long text.Then,utilizing a softmax layer to automatically extract text features and classify geohazards without manual features.The latent Dirichlet allocation(LDA)model is used to examine the interdependencies that exist between causal variables to visualize geohazards.The proposed method is useful in enabling the machine-assisted interpretation of text-based geohazards.Moreover,it can help users visualize causes,processes,and other geohazards and assist decision-makers in emergency responses.
基金Natural Science Foundation of Guangdong Province,Grant/Award Number:2021A1515011339。
文摘With the progress and development of computer technology,applying machine learning methods to cancer research has become an important research field.To analyze the most recent research status and trends,main research topics,topic evolutions,research collaborations,and potential directions of this research field,this study conducts a bibliometric analysis on 6206 research articles worldwide collected from PubMed between 2011 and 2021 concerning cancer research using machine learning methods.Python is used as a tool for bibliometric analysis,Gephi is used for social network analysis,and the Latent Dirichlet Allocation model is used for topic modeling.The trend analysis of articles not only reflects the innovative research at the intersection of machine learning and cancer but also demonstrates its vigorous development and increasing impacts.In terms of journals,Nature Communications is the most influential journal and Scientific Reports is the most prolific one.The United States and Harvard University have contributed the most to cancer research using machine learning methods.As for the research topic,“Support Vector Machine,”“classification,”and“deep learning”have been the core focuses of the research field.Findings are helpful for scholars and related practitioners to better understand the development status and trends of cancer research using machine learning methods,as well as to have a deeper understanding of research hotspots.
基金Supported by Scientific Breakthrough Program of Beijing Municipal Science & Technology Commission,China(No. D08050703020803,No.D08050703020804)China Key Technologies R&D Programme(No.2007BA110B06-01)+2 种基金Major State Basic Research Development Program of China (973 Program,No.2006CB504601)National Nature Science Foundation of China(No.90709006)National Science and Technology Major Project of the Ministry of Science and Technology of China(No.2009ZX10005-019)
文摘Induction of common knowledge or regularities from large-scale clinical data is a vital task for Chinese medicine(CM).In this paper,we propose a data mining method,called the Symptom-Herb-Diagnosis topic(SHDT) model,to automatically extract the common relationships among symptoms,herb combinations and diagnoses from large-scale CM clinical data.The SHDT model is one of the multi-relational extensions of the latent topic model,which can acquire topic structure from discrete corpora(such as document collection) by capturing the semantic relations among words.We applied the SHDT model to discover the common CM diagnosis and treatment knowledge for type 2 diabetes mellitus(T2DM) using 3 238 inpatient cases.We obtained meaningful diagnosis and treatment topics(clusters) from the data,which clinically indicated some important medical groups corresponding to comorbidity diseases(e.g.,heart disease and diabetic kidney diseases in T2DM inpatients).The results show that manifestation sub-categories actually exist in T2DM patients that need specific,individualised CM therapies.Furthermore,the results demonstrate that this method is helpful for generating CM clinical guidelines for T2DM based on structured collected clinical data.