Topic modeling is a probabilistic model that identifies topics covered in text(s). In this paper, topics were loaded from two implementations of topic modeling, namely, Latent Semantic Indexing (LSI) and Latent Dirich...Topic modeling is a probabilistic model that identifies topics covered in text(s). In this paper, topics were loaded from two implementations of topic modeling, namely, Latent Semantic Indexing (LSI) and Latent Dirichlet Allocation (LDA). This analysis was performed in a corpus of 1000 academic papers written in English, obtained from PLOS ONE website, in the areas of Biology, Medicine, Physics and Social Sciences. The objective is to verify if the four academic fields were represented in the four topics obtained by topic modeling. The four topics obtained from Latent Semantic Indexing (LSI) and Latent Dirichlet Allocation (LDA) did not represent the four academic fields.展开更多
The sudden arrival of AI(Artificial Intelligence) into people's daily lives all around the world was marked by the introduction of ChatGPT, which was officially released on November 30, 2022. This AI invasion in o...The sudden arrival of AI(Artificial Intelligence) into people's daily lives all around the world was marked by the introduction of ChatGPT, which was officially released on November 30, 2022. This AI invasion in our lives drew the attention of not only tech enthusiasts but also scholars from diverse fields, as its capacity extends across various fields. Consequently, numerous articles and journals have been discussing ChatGPT, making it a headline for several topics. However, it does not reflect most public opinion about the product. Therefore, this paper investigated the public's opinions on ChatGPT through topic modelling, Vader-based sentiment analysis and SWOT analysis. To gather data for this study, 202905 comments from the Reddit platform were collected between December 2022 and December 2023. The findings reveal that the Reddit community engaged in discussions related to ChatGPT, covering a range of topics including comparisons with traditional search engines, the impacts on software development, job market, and education industry, exploring ChatGPT's responses on entertainment and politics, the responses from Dan, the alter ego of ChatGPT, the ethical usage of user data as well as queries related to the AI-generated images. The sentiment analysis indicates that most people hold positive views towards this innovative technology across these several aspects. However, concerns also arise regarding the potential negative impacts associated with this product. The SWOT analysis of these results highlights both the strengths and pain points, market opportunities and threats associated with ChatGPT. This analysis also serves as a foundation for providing recommendations aimed at the product development and policy implementation in this paper.展开更多
The health care system encompasses the participation of individuals,groups,agencies,and resources that offer services to address the requirements of the person,community,and population in terms of health.Parallel to t...The health care system encompasses the participation of individuals,groups,agencies,and resources that offer services to address the requirements of the person,community,and population in terms of health.Parallel to the rising debates on the healthcare systems in relation to diseases,treatments,interventions,medication,and clinical practice guidelines,the world is currently discussing the healthcare industry,technology perspectives,and healthcare costs.To gain a comprehensive understanding of the healthcare systems research paradigm,we offered a novel contextual topic modeling approach that links up the CombinedTM model with our healthcare Bert to discover the contextual topics in the domain of healthcare.This research work discovered 60 contextual topics among them fteen topics are the hottest which include smart medical monitoring systems,causes,and effects of stress and anxiety,and healthcare cost estimation and twelve topics are the coldest.Moreover,thirty-three topics are showing in-significant trends.We further investigated various clusters and correlations among the topics exploring inter-topic distance maps which add depth to the understanding of the research structure of this scientific domain.The current study enhances the prior topic modeling methodologies that examine the healthcare literature from a particular disciplinary perspective.It further extends the existing topic modeling approaches that do not incorporate contextual information in the topic discovery process adding contextual information by creating sentence embedding vectors through transformers-based models.We also utilized corpus tuning,the mean pooling technique,and the hugging face tool.Our method gives a higher coherence score as compared to the state-of-the-art models(LSA,LDA,and Ber Topic).展开更多
This paper develops a novel online algorithm, namely moving average stochastic variational inference (MASVI), which applies the results obtained by previous iterations to smooth out noisy natural gradients. We analy...This paper develops a novel online algorithm, namely moving average stochastic variational inference (MASVI), which applies the results obtained by previous iterations to smooth out noisy natural gradients. We analyze the convergence property of the proposed algorithm and conduct a set of experiments on two large-scale collections that contain millions of documents. Experimental results indicate that in contrast to algorithms named 'stochastic variational inference' and 'SGRLD', our algorithm achieves a faster convergence rate and better performance.展开更多
This paper deals with the statistical modeling of latent topic hierarchies in text corpora. The height of the topic tree is assumed as fixed, while the number of topics on each level as unknown a priori and to be infe...This paper deals with the statistical modeling of latent topic hierarchies in text corpora. The height of the topic tree is assumed as fixed, while the number of topics on each level as unknown a priori and to be inferred from data. Taking a nonpara-metric Bayesian approach to this problem, we propose a new probabilistic generative model based on the nested hierarchical Dirichlet process (nHDP) and present a Markov chain Monte Carlo sampling algorithm for the inference of the topic tree structure as well as the word distribution of each topic and topic distribution of each document. Our theoretical analysis and experiment results show that this model can produce a more compact hierarchical topic structure and captures more fine-grained topic rela-tionships compared to the hierarchical latent Dirichlet allocation model.展开更多
User-Generated Content(UGC)provides a potential data source which can help us to better describe and understand how places are conceptualized,and in turn better represent the places in Geographic Information Science(G...User-Generated Content(UGC)provides a potential data source which can help us to better describe and understand how places are conceptualized,and in turn better represent the places in Geographic Information Science(GIScience).In this article,we aim at aggregating the shared meanings associated with places and linking these to a conceptual model of place.Our focus is on the metadata of Flickr images,in the form of locations and tags.We use topic modeling to identify regions associated with shared meanings.We choose a grid approach and generate topics associated with one or more cells using Latent Dirichlet Allocation.We analyze the sensitivity of our results to both grid resolution and the chosen number of topics using a range of measures including corpus distance and the coherence value.Using a resolution of 500 m and with 40 topics,we are able to generate meaningful topics which characterize places in London based on 954 unique tags associated with around 300,000 images and more than 7000 individuals.展开更多
User-generated content(UGC) such as blogs and twitters are exploding in modern Internet services. In such systems, recommender systems are needed to help people filter vast amount of UGC generated by other users. Howe...User-generated content(UGC) such as blogs and twitters are exploding in modern Internet services. In such systems, recommender systems are needed to help people filter vast amount of UGC generated by other users. However, traditional recommendation models do not use user authorship of items. In this paper, we show that with this additional information, we can significantly improve the performance of recommendations. A generative model that combines hierarchical topic modeling and matrix factorization is proposed. Empirical results show that our model outperforms other state-of-the-art models, and can provide interpretable topic structures for users and items. Furthermore, since user interests can be inferred from their productions, recommendations can be made for users that do not have any ratings to solve the cold-start problem.展开更多
Many existing warning prioritization techniques seek to reorder the static analysis warnings such that true positives are provided first. However, excessive amount of time is required therein to investigate and fix pr...Many existing warning prioritization techniques seek to reorder the static analysis warnings such that true positives are provided first. However, excessive amount of time is required therein to investigate and fix prioritized warnings because some are not actually true positives or are irrelevant to the code context and topic. In this paper, we propose a warning prioritization technique that reflects various latent topics from bug-related code blocks. Our main aim is to build a prioritization model that comprises separate warning priorities depending on the topic of the change sets to identify the number of true positive warnings. For the performance evaluation of the proposed model, we employ a performance metric called warning detection rate, widely used in many warning prioritization studies, and compare the proposed model with other competitive techniques. Additionally, the effectiveness of our model is verified via the application of our technique to eight industrial projects of a real global company.展开更多
Environmental,social,and governance(ESG)factors are critical in achieving sustainability in business management and are used as values aiming to enhance corporate value.Recently,non-financial indicators have been cons...Environmental,social,and governance(ESG)factors are critical in achieving sustainability in business management and are used as values aiming to enhance corporate value.Recently,non-financial indicators have been considered as important for the actual valuation of corporations,thus analyzing natural language data related to ESG is essential.Several previous studies limited their focus to specific countries or have not used big data.Past methodologies are insufficient for obtaining potential insights into the best practices to leverage ESG.To address this problem,in this study,the authors used data from two platforms:LexisNexis,a platform that provides media monitoring,and Web of Science,a platform that provides scientific papers.These big data were analyzed by topic modeling.Topic modeling can derive hidden semantic structures within the text.Through this process,it is possible to collect information on public and academic sentiment.The authors explored data from a text-mining perspective using bidirectional encoder representations from transformers topic(BERTopic)—a state-of-the-art topic-modeling technique.In addition,changes in subject patterns over time were considered using dynamic topic modeling.As a result,concepts proposed in an international organization such as the United Nations(UN)have been discussed in academia,and the media have formed a variety of agendas.展开更多
Emerging topics in app reviews highlight the topics(e.g.,software bugs)with which users are concerned during certain periods.Identifying emerging topics accurately,and in a timely manner,could help developers more eff...Emerging topics in app reviews highlight the topics(e.g.,software bugs)with which users are concerned during certain periods.Identifying emerging topics accurately,and in a timely manner,could help developers more effectively update apps.Methods for identifying emerging topics in app reviews based on topic models or clustering methods have been proposed in the literature.However,the accuracy of emerging topic identification is reduced because reviews are short in length and offer limited information.To solve this problem,an improved emerging topic identification(IETI)approach is proposed in this work.Specifically,we adopt natural language processing techniques to reduce noisy data,and identify emerging topics in app reviews using the adaptive online biterm topic model.Then we interpret the implicature of emerging topics through relevant phrases and sentences.We adopt the official app changelogs as ground truth,and evaluate IETI in six common apps.The experimental results indicate that IETI is more accurate than the baseline in identifying emerging topics,with improvements in the F1 score of 0.126 for phrase labels and 0.061 for sentence labels.Finally,we release the codes of IETI on Github(https://github.com/wanizhou/IETI).展开更多
Recently,automation is considered vital in most fields since computing methods have a significant role in facilitating work such as automatic text summarization.However,most of the computing methods that are used in r...Recently,automation is considered vital in most fields since computing methods have a significant role in facilitating work such as automatic text summarization.However,most of the computing methods that are used in real systems are based on graph models,which are characterized by their simplicity and stability.Thus,this paper proposes an improved extractive text summarization algorithm based on both topic and graph models.The methodology of this work consists of two stages.First,the well-known TextRank algorithm is analyzed and its shortcomings are investigated.Then,an improved method is proposed with a new computational model of sentence weights.The experimental results were carried out on standard DUC2004 and DUC2006 datasets and compared to four text summarization methods.Finally,through experiments on the DUC2004 and DUC2006 datasets,our proposed improved graph model algorithm TG-SMR(Topic Graph-Summarizer)is compared to other text summarization systems.The experimental results prove that the proposed TG-SMR algorithm achieves higher ROUGE scores.It is foreseen that the TG-SMR algorithm will open a new horizon that concerns the performance of ROUGE evaluation indicators.展开更多
Retelling extraction is an important branch of Natural Language Processing(NLP),and high-quality retelling resources are very helpful to improve the performance of machine translation.However,traditional methods based...Retelling extraction is an important branch of Natural Language Processing(NLP),and high-quality retelling resources are very helpful to improve the performance of machine translation.However,traditional methods based on the bilingual parallel corpus often ignore the document background in the process of retelling acquisition and application.In order to solve this problem,we introduce topic model information into the translation mode and propose a topic-based statistical machine translation method to improve the translation performance.In this method,Probabilistic Latent Semantic Analysis(PLSA)is used to obtains the co-occurrence relationship between words and documents by the hybrid matrix decomposition.Then we design a decoder to simplify the decoding process.Experiments show that the proposed method can effectively improve the accuracy of translation.展开更多
Globally,educational institutions have reported a dramatic shift to online learning in an effort to contain the COVID-19 pandemic.The fundamental concern has been the continuance of education.As a result,several novel...Globally,educational institutions have reported a dramatic shift to online learning in an effort to contain the COVID-19 pandemic.The fundamental concern has been the continuance of education.As a result,several novel solutions have been developed to address technical and pedagogical issues.However,these were not the only difficulties that students faced.The implemented solutions involved the operation of the educational process with less regard for students’changing circumstances,which obliged them to study from home.Students should be asked to provide a full list of their concerns.As a result,student reflections,including those from Saudi Arabia,have been analysed to identify obstacles encountered during the COVID-19 pandemic.However,most of the analyses relied on closed-ended questions,which limited student involvement.To delve into students’responses,this study used open-ended questions,a qualitative method(content analysis),a quantitative method(topic modelling),and a sentimental analysis.This study also looked at students’emotional states during and after the COVID-19 pandemic.In terms of determining trends in students’input,the results showed that quantitative and qualitative methods produced similar outcomes.Students had unfavourable sentiments about studying during COVID-19 and positive sentiments about the face-to-face study.Furthermore,topic modelling has revealed that the majority of difficulties are more related to the environment(home)and social life.Students were less accepting of online learning.As a result,it is possible to conclude that face-to-face study still attracts students and provides benefits that online study cannot,such as social interaction and effective eye-to-eye communication.展开更多
This paper aims to develop Machine Learning algorithms to classify electronic articles related to this phenomenon by retrieving information and topic modelling.The Methodology of this study is categorized into three p...This paper aims to develop Machine Learning algorithms to classify electronic articles related to this phenomenon by retrieving information and topic modelling.The Methodology of this study is categorized into three phases:the Text Classification Approach(TCA),the Proposed Algorithms Interpretation(PAI),andfinally,Information Retrieval Approach(IRA).The TCA reflects the text preprocessing pipeline called a clean corpus.The Global Vec-tors for Word Representation(Glove)pre-trained model,FastText,Term Frequency-Inverse Document Fre-quency(TF-IDF),and Bag-of-Words(BOW)for extracting the features have been interpreted in this research.The PAI manifests the Bidirectional Long Short-Term Memory(Bi-LSTM)and Convolutional Neural Network(CNN)to classify the COVID-19 news.Again,the IRA explains the mathematical interpretation of Latent Dirich-let Allocation(LDA),obtained for modelling the topic of Information Retrieval(IR).In this study,99%accuracy was obtained by performing K-fold cross-validation on Bi-LSTM with Glove.A comparative analysis between Deep Learning and Machine Learning based on feature extraction and computational complexity exploration has been performed in this research.Furthermore,some text analyses and the most influential aspects of each document have been explored in this study.We have utilized Bidirectional Encoder Representations from Trans-formers(BERT)as a Deep Learning mechanism in our model training,but the result has not been uncovered satisfactory.However,the proposed system can be adjustable in the real-time news classification of COVID-19.展开更多
Purpose:This paper reports on a scientometric analysis bolstered by human-in-the-loop,domain experts,to examine the field of metal-organic frameworks(MOFs)research.Scientometric analyses reveal the intellectual landsc...Purpose:This paper reports on a scientometric analysis bolstered by human-in-the-loop,domain experts,to examine the field of metal-organic frameworks(MOFs)research.Scientometric analyses reveal the intellectual landscape of a field.The study engaged MOF scientists in the design and review of our research workflow.MOF materials are an essential component in next-generation renewable energy storage and biomedical technologies.The research approach demonstrates how engaging experts,via human-in-the-loop processes,can help develop a comprehensive view of a field’s research trends,influential works,and specialized topics.Design/methodology/approach:Ascientometric analysis was conducted,integrating natural language processing(NLP),topic modeling,and network analysis methods.The analytical approach was enhanced through a human-in-the-loop iterative process involving MOF research scientists at selected intervals.MOF researcher feedback was incorporated into our method.The data sample included 65,209 MOF research articles.Python3 and software tool VOSviewer were used to perform the analysis.Findings:The findings demonstrate the value of including domain experts in research workflows,refinement,and interpretation of results.At each stage of the analysis,the MOF researchers contributed to interpreting the results and method refinements targeting our focus Research evolution of metal organic frameworks:A scientometric approach with human-in-the-loop on MOF research.This study identified influential works and their themes.Our findings also underscore four main MOF research directions and applications.Research limitations:This study is limited by the sample(articles identified and referenced by the Cambridge Structural Database)that informed our analysis.Practical implications:Our findings contribute to addressing the current gap in fully mapping out the comprehensive landscape of MOF research.Additionally,the results will help domain scientists target future research directions.Originality/value:To the best of our knowledge,the number of publications collected for analysis exceeds those of previous studies.This enabled us to explore a more extensive body of MOF research compared to previous studies.Another contribution of our work is the iterative engagement of domain scientists,who brought in-depth,expert interpretation to the data analysis,helping hone the study.展开更多
Aiming to identify policy topics and their evolutionary logic that enhance the digital and green development(dual development)of traditional manufacturing enterprises,address weaknesses in current policies,and provide...Aiming to identify policy topics and their evolutionary logic that enhance the digital and green development(dual development)of traditional manufacturing enterprises,address weaknesses in current policies,and provide resources for refining dual development policies,a total of 15954 dual development-related policies issued by national and various departmental authorities in China from January 2000 to August 2023 were analyzed.Based on topic modeling techniques and the policy modeling consistency(PMC)framework,the evolution of policy topics was visualized,and a dynamic assessment of the policies was conducted.The results show that the digital and green development policy framework is progressively refined,and the governance philosophy shifts from a“regulatory government”paradigm to a“service-oriented government”.The support pattern evolves from“dispersed matching”to“integrated symbiosis”.However,there are still significant deficiencies in departmental cooperation,balanced measures,coordinated links,and multi-stakeholder participation.Future policy improvements should,therefore,focus on guiding multi-stakeholder participation,enhancing public demand orientation,and addressing the entire value chain.These steps aim to create an open and shared digital industry ecosystem to promote the coordinated dual development of traditional manufacturing enterprises.展开更多
Social media has revolutionized the dissemination of real-life information,serving as a robust platform for sharing life events.Twitter,characterized by its brevity and continuous flow of posts,has emerged as a crucia...Social media has revolutionized the dissemination of real-life information,serving as a robust platform for sharing life events.Twitter,characterized by its brevity and continuous flow of posts,has emerged as a crucial source for public health surveillance,offering valuable insights into public reactions during the COVID-19 pandemic.This study aims to leverage a range of machine learning techniques to extract pivotal themes and facilitate text classification on a dataset of COVID-19 outbreak-related tweets.Diverse topic modeling approaches have been employed to extract pertinent themes and subsequently form a dataset for training text classification models.An assessment of coherence metrics revealed that the Gibbs Sampling Dirichlet Mixture Model(GSDMM),which utilizes trigram and bag-of-words(BOW)feature extraction,outperformed Non-negative Matrix Factorization(NMF),Latent Dirichlet Allocation(LDA),and a hybrid strategy involving Bidirectional Encoder Representations from Transformers(BERT)combined with LDA and K-means to pinpoint significant themes within the dataset.Among the models assessed for text clustering,the utilization of LDA,either as a clustering model or for feature extraction combined with BERT for K-means,resulted in higher coherence scores,consistent with human ratings,signifying their efficacy.In particular,LDA,notably in conjunction with trigram representation and BOW,demonstrated superior performance.This underscores the suitability of LDA for conducting topic modeling,given its proficiency in capturing intricate textual relationships.In the context of text classification,models such as Linear Support Vector Classification(LSVC),Long Short-Term Memory(LSTM),Bidirectional Long Short-Term Memory(BiLSTM),Convolutional Neural Network with BiLSTM(CNN-BiLSTM),and BERT have shown outstanding performance,achieving accuracy and weighted F1-Score scores exceeding 80%.These results significantly surpassed other models,such as Multinomial Naive Bayes(MNB),Linear Support Vector Machine(LSVM),and Logistic Regression(LR),which achieved scores in the range of 60 to 70 percent.展开更多
Forest habitats are critical for biodiversity,ecosystem services,human livelihoods,and well-being.Capacity to conduct theoretical and applied forest ecology research addressing direct(e.g.,deforestation)and indirect(e...Forest habitats are critical for biodiversity,ecosystem services,human livelihoods,and well-being.Capacity to conduct theoretical and applied forest ecology research addressing direct(e.g.,deforestation)and indirect(e.g.,climate change)anthropogenic pressures has benefited considerably from new field-and statistical-techniques.We used machine learning and bibliometric structural topic modelling to identify 20 latent topics comprising four principal fields from a corpus of 16,952 forest ecology/forestry articles published in eight ecology and five forestry journals between 2010 and 2022.Articles published per year increased from 820 in 2010 to 2,354 in 2021,shifting toward more applied topics.Publications from China and some countries in North America and Europe dominated,with relatively fewer articles from some countries in West and Central Africa and West Asia,despite globally important forest resources.Most study sites were in some countries in North America,Central Asia,and South America,and Australia.Articles utilizing R statistical software predominated,increasing from 29.5%in 2010 to 71.4%in 2022.The most frequently used packages included lme4,vegan,nlme,MuMIn,ggplot2,car,MASS,mgcv,multcomp and raster.R was more often used in forest ecology than applied forestry articles.R software offers advantages in script and workflow-sharing compared to other statistical packages.Our findings demonstrate that the disciplines of forest ecology/forestry are expanding both in number and scope,aided by more sophisticated statistical tools,to tackle the challenges of redressing forest habitat loss and the socio-economic impacts of deforestation.展开更多
Most research on anomaly detection has focused on event that is different from its spatial-temporal neighboring events.It is still a significant challenge to detect anomalies that involve multiple normal events intera...Most research on anomaly detection has focused on event that is different from its spatial-temporal neighboring events.It is still a significant challenge to detect anomalies that involve multiple normal events interacting in an unusual pattern.In this work,a novel unsupervised method based on sparse topic model was proposed to capture motion patterns and detect anomalies in traffic surveillance.scale-invariant feature transform(SIFT)flow was used to improve the dense trajectory in order to extract interest points and the corresponding descriptors with less interference.For the purpose of strengthening the relationship of interest points on the same trajectory,the fisher kernel method was applied to obtain the representation of trajectory which was quantized into visual word.Then the sparse topic model was proposed to explore the latent motion patterns and achieve a sparse representation for the video scene.Finally,two anomaly detection algorithms were compared based on video clip detection and visual word analysis respectively.Experiments were conducted on QMUL Junction dataset and AVSS dataset.The results demonstrated the superior efficiency of the proposed method.展开更多
Topic models such as Latent Dirichlet Allocation(LDA) have been successfully applied to many text mining tasks for extracting topics embedded in corpora. However, existing topic models generally cannot discover bursty...Topic models such as Latent Dirichlet Allocation(LDA) have been successfully applied to many text mining tasks for extracting topics embedded in corpora. However, existing topic models generally cannot discover bursty topics that experience a sudden increase during a period of time. In this paper, we propose a new topic model named Burst-LDA, which simultaneously discovers topics and reveals their burstiness through explicitly modeling each topic's burst states with a first order Markov chain and using the chain to generate the topic proportion of documents in a Logistic Normal fashion. A Gibbs sampling algorithm is developed for the posterior inference of the proposed model. Experimental results on a news data set show our model can efficiently discover bursty topics, outperforming the state-of-the-art method.展开更多
文摘Topic modeling is a probabilistic model that identifies topics covered in text(s). In this paper, topics were loaded from two implementations of topic modeling, namely, Latent Semantic Indexing (LSI) and Latent Dirichlet Allocation (LDA). This analysis was performed in a corpus of 1000 academic papers written in English, obtained from PLOS ONE website, in the areas of Biology, Medicine, Physics and Social Sciences. The objective is to verify if the four academic fields were represented in the four topics obtained by topic modeling. The four topics obtained from Latent Semantic Indexing (LSI) and Latent Dirichlet Allocation (LDA) did not represent the four academic fields.
文摘The sudden arrival of AI(Artificial Intelligence) into people's daily lives all around the world was marked by the introduction of ChatGPT, which was officially released on November 30, 2022. This AI invasion in our lives drew the attention of not only tech enthusiasts but also scholars from diverse fields, as its capacity extends across various fields. Consequently, numerous articles and journals have been discussing ChatGPT, making it a headline for several topics. However, it does not reflect most public opinion about the product. Therefore, this paper investigated the public's opinions on ChatGPT through topic modelling, Vader-based sentiment analysis and SWOT analysis. To gather data for this study, 202905 comments from the Reddit platform were collected between December 2022 and December 2023. The findings reveal that the Reddit community engaged in discussions related to ChatGPT, covering a range of topics including comparisons with traditional search engines, the impacts on software development, job market, and education industry, exploring ChatGPT's responses on entertainment and politics, the responses from Dan, the alter ego of ChatGPT, the ethical usage of user data as well as queries related to the AI-generated images. The sentiment analysis indicates that most people hold positive views towards this innovative technology across these several aspects. However, concerns also arise regarding the potential negative impacts associated with this product. The SWOT analysis of these results highlights both the strengths and pain points, market opportunities and threats associated with ChatGPT. This analysis also serves as a foundation for providing recommendations aimed at the product development and policy implementation in this paper.
文摘The health care system encompasses the participation of individuals,groups,agencies,and resources that offer services to address the requirements of the person,community,and population in terms of health.Parallel to the rising debates on the healthcare systems in relation to diseases,treatments,interventions,medication,and clinical practice guidelines,the world is currently discussing the healthcare industry,technology perspectives,and healthcare costs.To gain a comprehensive understanding of the healthcare systems research paradigm,we offered a novel contextual topic modeling approach that links up the CombinedTM model with our healthcare Bert to discover the contextual topics in the domain of healthcare.This research work discovered 60 contextual topics among them fteen topics are the hottest which include smart medical monitoring systems,causes,and effects of stress and anxiety,and healthcare cost estimation and twelve topics are the coldest.Moreover,thirty-three topics are showing in-significant trends.We further investigated various clusters and correlations among the topics exploring inter-topic distance maps which add depth to the understanding of the research structure of this scientific domain.The current study enhances the prior topic modeling methodologies that examine the healthcare literature from a particular disciplinary perspective.It further extends the existing topic modeling approaches that do not incorporate contextual information in the topic discovery process adding contextual information by creating sentence embedding vectors through transformers-based models.We also utilized corpus tuning,the mean pooling technique,and the hugging face tool.Our method gives a higher coherence score as compared to the state-of-the-art models(LSA,LDA,and Ber Topic).
基金Project supported by the National Natural Science Foundation of China (Nos. 61170092, 61133011, and 61103091)
文摘This paper develops a novel online algorithm, namely moving average stochastic variational inference (MASVI), which applies the results obtained by previous iterations to smooth out noisy natural gradients. We analyze the convergence property of the proposed algorithm and conduct a set of experiments on two large-scale collections that contain millions of documents. Experimental results indicate that in contrast to algorithms named 'stochastic variational inference' and 'SGRLD', our algorithm achieves a faster convergence rate and better performance.
基金Project (No. 60773180) supported by the National Natural Science Foundation of China
文摘This paper deals with the statistical modeling of latent topic hierarchies in text corpora. The height of the topic tree is assumed as fixed, while the number of topics on each level as unknown a priori and to be inferred from data. Taking a nonpara-metric Bayesian approach to this problem, we propose a new probabilistic generative model based on the nested hierarchical Dirichlet process (nHDP) and present a Markov chain Monte Carlo sampling algorithm for the inference of the topic tree structure as well as the word distribution of each topic and topic distribution of each document. Our theoretical analysis and experiment results show that this model can produce a more compact hierarchical topic structure and captures more fine-grained topic rela-tionships compared to the hierarchical latent Dirichlet allocation model.
基金funded by the Swiss National Science Foundation Project PlaceGen[grant number 200021_149823].
文摘User-Generated Content(UGC)provides a potential data source which can help us to better describe and understand how places are conceptualized,and in turn better represent the places in Geographic Information Science(GIScience).In this article,we aim at aggregating the shared meanings associated with places and linking these to a conceptual model of place.Our focus is on the metadata of Flickr images,in the form of locations and tags.We use topic modeling to identify regions associated with shared meanings.We choose a grid approach and generate topics associated with one or more cells using Latent Dirichlet Allocation.We analyze the sensitivity of our results to both grid resolution and the chosen number of topics using a range of measures including corpus distance and the coherence value.Using a resolution of 500 m and with 40 topics,we are able to generate meaningful topics which characterize places in London based on 954 unique tags associated with around 300,000 images and more than 7000 individuals.
基金Project supported by the Monitoring Statistics Project on Agricultural and Rural Resources,MOA,Chinathe Innovative Talents Project,MOA,Chinathe Science and Technology Innovation Project Fund of Chinese Academy of Agricultural Sciences(No.CAAS-ASTIP-2015-AI I-02)
文摘User-generated content(UGC) such as blogs and twitters are exploding in modern Internet services. In such systems, recommender systems are needed to help people filter vast amount of UGC generated by other users. However, traditional recommendation models do not use user authorship of items. In this paper, we show that with this additional information, we can significantly improve the performance of recommendations. A generative model that combines hierarchical topic modeling and matrix factorization is proposed. Empirical results show that our model outperforms other state-of-the-art models, and can provide interpretable topic structures for users and items. Furthermore, since user interests can be inferred from their productions, recommendations can be made for users that do not have any ratings to solve the cold-start problem.
基金The research was supported by Basic Science Research Program through the National Research Foundation of Korea(NRF)funded by the Ministry of Science,ICT&Future Planning under Grant No.NRF-2019R1A2C2084158Samsung Electronics Co.Ltd.
文摘Many existing warning prioritization techniques seek to reorder the static analysis warnings such that true positives are provided first. However, excessive amount of time is required therein to investigate and fix prioritized warnings because some are not actually true positives or are irrelevant to the code context and topic. In this paper, we propose a warning prioritization technique that reflects various latent topics from bug-related code blocks. Our main aim is to build a prioritization model that comprises separate warning priorities depending on the topic of the change sets to identify the number of true positive warnings. For the performance evaluation of the proposed model, we employ a performance metric called warning detection rate, widely used in many warning prioritization studies, and compare the proposed model with other competitive techniques. Additionally, the effectiveness of our model is verified via the application of our technique to eight industrial projects of a real global company.
基金supported by a National Research Foundation of Korea(NRF)(http://nrf.re.kr/eng/index)grant funded by the Korean government(RS-2023-00208278).
文摘Environmental,social,and governance(ESG)factors are critical in achieving sustainability in business management and are used as values aiming to enhance corporate value.Recently,non-financial indicators have been considered as important for the actual valuation of corporations,thus analyzing natural language data related to ESG is essential.Several previous studies limited their focus to specific countries or have not used big data.Past methodologies are insufficient for obtaining potential insights into the best practices to leverage ESG.To address this problem,in this study,the authors used data from two platforms:LexisNexis,a platform that provides media monitoring,and Web of Science,a platform that provides scientific papers.These big data were analyzed by topic modeling.Topic modeling can derive hidden semantic structures within the text.Through this process,it is possible to collect information on public and academic sentiment.The authors explored data from a text-mining perspective using bidirectional encoder representations from transformers topic(BERTopic)—a state-of-the-art topic-modeling technique.In addition,changes in subject patterns over time were considered using dynamic topic modeling.As a result,concepts proposed in an international organization such as the United Nations(UN)have been discussed in academia,and the media have formed a variety of agendas.
基金Project supported by the Anhui Provincial Natural Science Foundation of China(No.1908085MF183)the National Natural Science Foundation of China(Nos.62002084and 61976005)+4 种基金the Training Program for Young and MiddleAged Top Talents of Anhui Polytechnic University,China(No.201812)the Zhejiang Provincial Natural Science Foundation of China(No.LQ21F020004)the State Key Laboratory for Novel Software Technology(Nanjing University)Research Program,China(No.KFKT2019B23)the Open Research Fund of Anhui Key Laboratory of Detection Technology and Energy Saving Devices,Anhui Polytechnic University,China(No.DTESD2020B03)the Stable Support Plan for Colleges and Universities in Shenzhen,China(No.GXWD20201230155427003-20200730101839009)。
文摘Emerging topics in app reviews highlight the topics(e.g.,software bugs)with which users are concerned during certain periods.Identifying emerging topics accurately,and in a timely manner,could help developers more effectively update apps.Methods for identifying emerging topics in app reviews based on topic models or clustering methods have been proposed in the literature.However,the accuracy of emerging topic identification is reduced because reviews are short in length and offer limited information.To solve this problem,an improved emerging topic identification(IETI)approach is proposed in this work.Specifically,we adopt natural language processing techniques to reduce noisy data,and identify emerging topics in app reviews using the adaptive online biterm topic model.Then we interpret the implicature of emerging topics through relevant phrases and sentences.We adopt the official app changelogs as ground truth,and evaluate IETI in six common apps.The experimental results indicate that IETI is more accurate than the baseline in identifying emerging topics,with improvements in the F1 score of 0.126 for phrase labels and 0.061 for sentence labels.Finally,we release the codes of IETI on Github(https://github.com/wanizhou/IETI).
文摘Recently,automation is considered vital in most fields since computing methods have a significant role in facilitating work such as automatic text summarization.However,most of the computing methods that are used in real systems are based on graph models,which are characterized by their simplicity and stability.Thus,this paper proposes an improved extractive text summarization algorithm based on both topic and graph models.The methodology of this work consists of two stages.First,the well-known TextRank algorithm is analyzed and its shortcomings are investigated.Then,an improved method is proposed with a new computational model of sentence weights.The experimental results were carried out on standard DUC2004 and DUC2006 datasets and compared to four text summarization methods.Finally,through experiments on the DUC2004 and DUC2006 datasets,our proposed improved graph model algorithm TG-SMR(Topic Graph-Summarizer)is compared to other text summarization systems.The experimental results prove that the proposed TG-SMR algorithm achieves higher ROUGE scores.It is foreseen that the TG-SMR algorithm will open a new horizon that concerns the performance of ROUGE evaluation indicators.
基金supported by National Social Science Fund of China(Youth Program):“A Study of Acceptability of Chinese Government Public Signs in the New Era and the Countermeasures of the English Translation”(No.:13CYY010)the Subject Construction and Management Project of Zhejiang Gongshang University:“Research on the Organic Integration Path of Constructing Ideological and Political Training and Design of Mixed Teaching Platform during Epidemic Period”(No.:XKJS2020007)Ministry of Education IndustryUniversity Cooperative Education Program:“Research on the Construction of Cross-border Logistics Marketing Bilingual Course Integration”(NO.:202102494002).
文摘Retelling extraction is an important branch of Natural Language Processing(NLP),and high-quality retelling resources are very helpful to improve the performance of machine translation.However,traditional methods based on the bilingual parallel corpus often ignore the document background in the process of retelling acquisition and application.In order to solve this problem,we introduce topic model information into the translation mode and propose a topic-based statistical machine translation method to improve the translation performance.In this method,Probabilistic Latent Semantic Analysis(PLSA)is used to obtains the co-occurrence relationship between words and documents by the hybrid matrix decomposition.Then we design a decoder to simplify the decoding process.Experiments show that the proposed method can effectively improve the accuracy of translation.
文摘Globally,educational institutions have reported a dramatic shift to online learning in an effort to contain the COVID-19 pandemic.The fundamental concern has been the continuance of education.As a result,several novel solutions have been developed to address technical and pedagogical issues.However,these were not the only difficulties that students faced.The implemented solutions involved the operation of the educational process with less regard for students’changing circumstances,which obliged them to study from home.Students should be asked to provide a full list of their concerns.As a result,student reflections,including those from Saudi Arabia,have been analysed to identify obstacles encountered during the COVID-19 pandemic.However,most of the analyses relied on closed-ended questions,which limited student involvement.To delve into students’responses,this study used open-ended questions,a qualitative method(content analysis),a quantitative method(topic modelling),and a sentimental analysis.This study also looked at students’emotional states during and after the COVID-19 pandemic.In terms of determining trends in students’input,the results showed that quantitative and qualitative methods produced similar outcomes.Students had unfavourable sentiments about studying during COVID-19 and positive sentiments about the face-to-face study.Furthermore,topic modelling has revealed that the majority of difficulties are more related to the environment(home)and social life.Students were less accepting of online learning.As a result,it is possible to conclude that face-to-face study still attracts students and provides benefits that online study cannot,such as social interaction and effective eye-to-eye communication.
文摘This paper aims to develop Machine Learning algorithms to classify electronic articles related to this phenomenon by retrieving information and topic modelling.The Methodology of this study is categorized into three phases:the Text Classification Approach(TCA),the Proposed Algorithms Interpretation(PAI),andfinally,Information Retrieval Approach(IRA).The TCA reflects the text preprocessing pipeline called a clean corpus.The Global Vec-tors for Word Representation(Glove)pre-trained model,FastText,Term Frequency-Inverse Document Fre-quency(TF-IDF),and Bag-of-Words(BOW)for extracting the features have been interpreted in this research.The PAI manifests the Bidirectional Long Short-Term Memory(Bi-LSTM)and Convolutional Neural Network(CNN)to classify the COVID-19 news.Again,the IRA explains the mathematical interpretation of Latent Dirich-let Allocation(LDA),obtained for modelling the topic of Information Retrieval(IR).In this study,99%accuracy was obtained by performing K-fold cross-validation on Bi-LSTM with Glove.A comparative analysis between Deep Learning and Machine Learning based on feature extraction and computational complexity exploration has been performed in this research.Furthermore,some text analyses and the most influential aspects of each document have been explored in this study.We have utilized Bidirectional Encoder Representations from Trans-formers(BERT)as a Deep Learning mechanism in our model training,but the result has not been uncovered satisfactory.However,the proposed system can be adjustable in the real-time news classification of COVID-19.
文摘Purpose:This paper reports on a scientometric analysis bolstered by human-in-the-loop,domain experts,to examine the field of metal-organic frameworks(MOFs)research.Scientometric analyses reveal the intellectual landscape of a field.The study engaged MOF scientists in the design and review of our research workflow.MOF materials are an essential component in next-generation renewable energy storage and biomedical technologies.The research approach demonstrates how engaging experts,via human-in-the-loop processes,can help develop a comprehensive view of a field’s research trends,influential works,and specialized topics.Design/methodology/approach:Ascientometric analysis was conducted,integrating natural language processing(NLP),topic modeling,and network analysis methods.The analytical approach was enhanced through a human-in-the-loop iterative process involving MOF research scientists at selected intervals.MOF researcher feedback was incorporated into our method.The data sample included 65,209 MOF research articles.Python3 and software tool VOSviewer were used to perform the analysis.Findings:The findings demonstrate the value of including domain experts in research workflows,refinement,and interpretation of results.At each stage of the analysis,the MOF researchers contributed to interpreting the results and method refinements targeting our focus Research evolution of metal organic frameworks:A scientometric approach with human-in-the-loop on MOF research.This study identified influential works and their themes.Our findings also underscore four main MOF research directions and applications.Research limitations:This study is limited by the sample(articles identified and referenced by the Cambridge Structural Database)that informed our analysis.Practical implications:Our findings contribute to addressing the current gap in fully mapping out the comprehensive landscape of MOF research.Additionally,the results will help domain scientists target future research directions.Originality/value:To the best of our knowledge,the number of publications collected for analysis exceeds those of previous studies.This enabled us to explore a more extensive body of MOF research compared to previous studies.Another contribution of our work is the iterative engagement of domain scientists,who brought in-depth,expert interpretation to the data analysis,helping hone the study.
基金The National Natural Science Foundation of China(No.71973023,42277493).
文摘Aiming to identify policy topics and their evolutionary logic that enhance the digital and green development(dual development)of traditional manufacturing enterprises,address weaknesses in current policies,and provide resources for refining dual development policies,a total of 15954 dual development-related policies issued by national and various departmental authorities in China from January 2000 to August 2023 were analyzed.Based on topic modeling techniques and the policy modeling consistency(PMC)framework,the evolution of policy topics was visualized,and a dynamic assessment of the policies was conducted.The results show that the digital and green development policy framework is progressively refined,and the governance philosophy shifts from a“regulatory government”paradigm to a“service-oriented government”.The support pattern evolves from“dispersed matching”to“integrated symbiosis”.However,there are still significant deficiencies in departmental cooperation,balanced measures,coordinated links,and multi-stakeholder participation.Future policy improvements should,therefore,focus on guiding multi-stakeholder participation,enhancing public demand orientation,and addressing the entire value chain.These steps aim to create an open and shared digital industry ecosystem to promote the coordinated dual development of traditional manufacturing enterprises.
文摘Social media has revolutionized the dissemination of real-life information,serving as a robust platform for sharing life events.Twitter,characterized by its brevity and continuous flow of posts,has emerged as a crucial source for public health surveillance,offering valuable insights into public reactions during the COVID-19 pandemic.This study aims to leverage a range of machine learning techniques to extract pivotal themes and facilitate text classification on a dataset of COVID-19 outbreak-related tweets.Diverse topic modeling approaches have been employed to extract pertinent themes and subsequently form a dataset for training text classification models.An assessment of coherence metrics revealed that the Gibbs Sampling Dirichlet Mixture Model(GSDMM),which utilizes trigram and bag-of-words(BOW)feature extraction,outperformed Non-negative Matrix Factorization(NMF),Latent Dirichlet Allocation(LDA),and a hybrid strategy involving Bidirectional Encoder Representations from Transformers(BERT)combined with LDA and K-means to pinpoint significant themes within the dataset.Among the models assessed for text clustering,the utilization of LDA,either as a clustering model or for feature extraction combined with BERT for K-means,resulted in higher coherence scores,consistent with human ratings,signifying their efficacy.In particular,LDA,notably in conjunction with trigram representation and BOW,demonstrated superior performance.This underscores the suitability of LDA for conducting topic modeling,given its proficiency in capturing intricate textual relationships.In the context of text classification,models such as Linear Support Vector Classification(LSVC),Long Short-Term Memory(LSTM),Bidirectional Long Short-Term Memory(BiLSTM),Convolutional Neural Network with BiLSTM(CNN-BiLSTM),and BERT have shown outstanding performance,achieving accuracy and weighted F1-Score scores exceeding 80%.These results significantly surpassed other models,such as Multinomial Naive Bayes(MNB),Linear Support Vector Machine(LSVM),and Logistic Regression(LR),which achieved scores in the range of 60 to 70 percent.
基金financially supported by the National Natural Science Foundation of China(31971541).
文摘Forest habitats are critical for biodiversity,ecosystem services,human livelihoods,and well-being.Capacity to conduct theoretical and applied forest ecology research addressing direct(e.g.,deforestation)and indirect(e.g.,climate change)anthropogenic pressures has benefited considerably from new field-and statistical-techniques.We used machine learning and bibliometric structural topic modelling to identify 20 latent topics comprising four principal fields from a corpus of 16,952 forest ecology/forestry articles published in eight ecology and five forestry journals between 2010 and 2022.Articles published per year increased from 820 in 2010 to 2,354 in 2021,shifting toward more applied topics.Publications from China and some countries in North America and Europe dominated,with relatively fewer articles from some countries in West and Central Africa and West Asia,despite globally important forest resources.Most study sites were in some countries in North America,Central Asia,and South America,and Australia.Articles utilizing R statistical software predominated,increasing from 29.5%in 2010 to 71.4%in 2022.The most frequently used packages included lme4,vegan,nlme,MuMIn,ggplot2,car,MASS,mgcv,multcomp and raster.R was more often used in forest ecology than applied forestry articles.R software offers advantages in script and workflow-sharing compared to other statistical packages.Our findings demonstrate that the disciplines of forest ecology/forestry are expanding both in number and scope,aided by more sophisticated statistical tools,to tackle the challenges of redressing forest habitat loss and the socio-economic impacts of deforestation.
基金Project(50808025)supported by the National Natural Science Foundation of ChinaProject(20090162110057)supported by the Doctoral Fund of Ministry of Education,China
文摘Most research on anomaly detection has focused on event that is different from its spatial-temporal neighboring events.It is still a significant challenge to detect anomalies that involve multiple normal events interacting in an unusual pattern.In this work,a novel unsupervised method based on sparse topic model was proposed to capture motion patterns and detect anomalies in traffic surveillance.scale-invariant feature transform(SIFT)flow was used to improve the dense trajectory in order to extract interest points and the corresponding descriptors with less interference.For the purpose of strengthening the relationship of interest points on the same trajectory,the fisher kernel method was applied to obtain the representation of trajectory which was quantized into visual word.Then the sparse topic model was proposed to explore the latent motion patterns and achieve a sparse representation for the video scene.Finally,two anomaly detection algorithms were compared based on video clip detection and visual word analysis respectively.Experiments were conducted on QMUL Junction dataset and AVSS dataset.The results demonstrated the superior efficiency of the proposed method.
基金Supported by the National High Technology Research and Development Program of China(No.2012AA011005)
文摘Topic models such as Latent Dirichlet Allocation(LDA) have been successfully applied to many text mining tasks for extracting topics embedded in corpora. However, existing topic models generally cannot discover bursty topics that experience a sudden increase during a period of time. In this paper, we propose a new topic model named Burst-LDA, which simultaneously discovers topics and reveals their burstiness through explicitly modeling each topic's burst states with a first order Markov chain and using the chain to generate the topic proportion of documents in a Logistic Normal fashion. A Gibbs sampling algorithm is developed for the posterior inference of the proposed model. Experimental results on a news data set show our model can efficiently discover bursty topics, outperforming the state-of-the-art method.