期刊文献+
共找到10篇文章
< 1 >
每页显示 20 50 100
Sentiment Analysis on Twitter Data Using Term Frequency-Inverse Document Frequency
1
作者 Akash Addiga Sikha Bagui 《Journal of Computer and Communications》 2022年第8期117-128,共12页
This study is an exploratory analysis of applying natural language processing techniques such as Term Frequency-Inverse Document Frequency and Sentiment Analysis on Twitter data. The uniqueness of this work is establi... This study is an exploratory analysis of applying natural language processing techniques such as Term Frequency-Inverse Document Frequency and Sentiment Analysis on Twitter data. The uniqueness of this work is established by determining the overall sentiment of a politician’s tweets based on TF-IDF values of terms used in their published tweets. By calculating the TF-IDF value of terms from the corpus, this work displays the correlation between TF-IDF score and polarity. The results of this work show that calculating the TF-IDF score of the corpus allows for a more accurate representation of the overall polarity since terms are given a weight based on their uniqueness and relevance rather than just the frequency at which they appear in the corpus. 展开更多
关键词 Sentiment Analysis Twitter Data term frequency Inverse term frequency term frequency-inverse document frequency (tf-idf) Social Media
下载PDF
基于改进的TF-IDF算法及共现词的主题词抽取算法 被引量:17
2
作者 公冶小燕 林培光 +2 位作者 任威隆 张晨 张春云 《南京大学学报(自然科学版)》 CAS CSCD 北大核心 2017年第6期1072-1080,共9页
信息主题的抽取是快速定位用户需求的基础任务,主题词抽取时主要存在三个问题:一是词语权重的计算,二是词语间关系的度量,三是数据维度灾难.在计算词权重时首先利用互信息确定共现词对,与词频、词性、词位置信息非线性组合,然后,根据词... 信息主题的抽取是快速定位用户需求的基础任务,主题词抽取时主要存在三个问题:一是词语权重的计算,二是词语间关系的度量,三是数据维度灾难.在计算词权重时首先利用互信息确定共现词对,与词频、词性、词位置信息非线性组合,然后,根据词权重构建文档—共现词矩阵并建立潜在语义分析(Latent Semantic Analysis,LSA)模型.该方法借助LSA模型的奇异值分解(Singular Value Decomposition,SVD)将文档—共现词矩阵映射到潜在语义空间,不仅实现数据降维,而且获得低维度的文档相似矩阵.最后,对文档相似矩阵进行k-means聚类,在同类文档中选出词权重最大的前几对共现词,作为该类文章的主题词.对比基于TF-IDF(Term Frequency-Inverse Document Frequency)和共现词抽取主题词的实验,该算法的准确度分别提高了19%和10%. 展开更多
关键词 共现词 互信息 语义分析(LSA) 奇异值分解(SVD) term frequency-inverse document frequency(tf-idf)
下载PDF
An improved TF-IDF approach for text classification 被引量:5
3
作者 张云涛 龚玲 王永成 《Journal of Zhejiang University-Science A(Applied Physics & Engineering)》 SCIE EI CAS CSCD 2005年第1期49-55,共7页
This paper presents a new improved term frequency/inverse document frequency (TF-IDF) approach which uses confidence, support and characteristic words to enhance the recall and precision of text classification. Synony... This paper presents a new improved term frequency/inverse document frequency (TF-IDF) approach which uses confidence, support and characteristic words to enhance the recall and precision of text classification. Synonyms defined by a lexicon are processed in the improved TF-IDF approach. We detailedly discuss and analyze the relationship among confidence, recall and precision. The experiments based on science and technology gave promising results that the new TF-IDF approach improves the precision and recall of text classification compared with the conventional TF-IDF approach. 展开更多
关键词 term frequency/inverse document frequency (tf-idf) Text classification CONFIDENCE SUPPORT Characteristic words
下载PDF
Enhanced Topic-Aware Summarization Using Statistical Graph Neural Networks
4
作者 Ayesha Khaliq Salman Afsar Awan +2 位作者 Fahad Ahmad Muhammad Azam Zia Muhammad Zafar Iqbal 《Computers, Materials & Continua》 SCIE EI 2024年第8期3221-3242,共22页
The rapid expansion of online content and big data has precipitated an urgent need for efficient summarization techniques to swiftly comprehend vast textual documents without compromising their original integrity.Curr... The rapid expansion of online content and big data has precipitated an urgent need for efficient summarization techniques to swiftly comprehend vast textual documents without compromising their original integrity.Current approaches in Extractive Text Summarization(ETS)leverage the modeling of inter-sentence relationships,a task of paramount importance in producing coherent summaries.This study introduces an innovative model that integrates Graph Attention Networks(GATs)with Transformer-based Bidirectional Encoder Representa-tions from Transformers(BERT)and Latent Dirichlet Allocation(LDA),further enhanced by Term Frequency-Inverse Document Frequency(TF-IDF)values,to improve sentence selection by capturing comprehensive topical information.Our approach constructs a graph with nodes representing sentences,words,and topics,thereby elevating the interconnectivity and enabling a more refined understanding of text structures.This model is stretched to Multi-Document Summarization(MDS)from Single-Document Summarization,offering significant improvements over existing models such as THGS-GMM and Topic-GraphSum,as demonstrated by empirical evaluations on benchmark news datasets like Cable News Network(CNN)/Daily Mail(DM)and Multi-News.The results consistently demonstrate superior performance,showcasing the model’s robustness in handling complex summarization tasks across single and multi-document contexts.This research not only advances the integration of BERT and LDA within a GATs but also emphasizes our model’s capacity to effectively manage global information and adapt to diverse summarization challenges. 展开更多
关键词 SUMMARIZATION graph attention network bidirectional encoder representations from transformers Latent Dirichlet Allocation term frequency-inverse document frequency
下载PDF
ISTC: A New Method for Clustering Search Results 被引量:2
5
作者 ZHANG Wei XU Baowen +1 位作者 ZHANG Weifeng XU Junling 《Wuhan University Journal of Natural Sciences》 CAS 2008年第4期501-504,共4页
A new common phrase scoring method is proposed according to term frequency-inverse document frequency (TFIDF) and independence of the phrase. Combining the two properties can help identify more reasonable common phr... A new common phrase scoring method is proposed according to term frequency-inverse document frequency (TFIDF) and independence of the phrase. Combining the two properties can help identify more reasonable common phrases, which improve the accuracy of clustering. Also, the equation to measure the in-dependence of a phrase is proposed in this paper. The new algorithm which improves suffix tree clustering algorithm (STC) is named as improved suffix tree clustering (ISTC). To validate the proposed algorithm, a prototype system is implemented and used to cluster several groups of web search results obtained from Google search engine. Experimental results show that the improved algorithm offers higher accuracy than traditional suffix tree clustering. 展开更多
关键词 Web search results clustering suffix tree term frequency-inverse document frequency (TFIDF) independence of phrases
下载PDF
Environmental complaint insights through text mining based on the driver,pressure,state,impact,and response(DPSIR)framework:Evidence from an Italian environmental agency
6
作者 Fabiana MANSERVISI Michele BANZI +5 位作者 Tomaso TONELLI Paolo VERONESI Susanna RICCI Damiano DISTANTE Stefano FARALLI Giuseppe BORTONE 《Regional Sustainability》 2023年第3期261-281,共21页
Individuals,local communities,environmental associations,private organizations,and public representatives and bodies may all be aggrieved by environmental problems concerning poor air quality,illegal waste disposal,wa... Individuals,local communities,environmental associations,private organizations,and public representatives and bodies may all be aggrieved by environmental problems concerning poor air quality,illegal waste disposal,water contamination,and general pollution.Environmental complaints represent the expressions of dissatisfaction with these issues.As the timeconsuming of managing a large number of complaints,text mining may be useful for automatically extracting information on stakeholder priorities and concerns.The paper used text mining and semantic network analysis to crawl relevant keywords about environmental complaints from two online complaint submission systems:online claim submission system of Regional Agency for Prevention,Environment and Energy(Arpae)(“Contact Arpae”);and Arpae's internal platform for environmental pollution(“Environmental incident reporting portal”)in the Emilia-Romagna Region,Italy.We evaluated the total of 2477 records and classified this information based on the claim topic(air pollution,water pollution,noise pollution,waste,odor,soil,weather-climate,sea-coast,and electromagnetic radiation)and geographical distribution.Then,this paper used natural language processing to extract keywords from the dataset,and classified keywords ranking higher in Term Frequency-Inverse Document Frequency(TF-IDF)based on the driver,pressure,state,impact,and response(DPSIR)framework.This study provided a systemic approach to understanding the interaction between people and environment in different geographical contexts and builds sustainable and healthy communities.The results showed that most complaints are from the public and associated with air pollution and odor.Factories(particularly foundries and ceramic industries)and farms are identified as the drivers of environmental issues.Citizen believed that environmental issues mainly affect human well-being.Moreover,the keywords of“odor”,“report”,“request”,“presence”,“municipality”,and“hours”were the most influential and meaningful concepts,as demonstrated by their high degree and betweenness centrality values.Keywords connecting odor(classified as impacts)and air pollution(classified as state)were the most important(such as“odor-burnt plastic”and“odor-acrid”).Complainants perceived odor annoyance as a primary environmental concern,possibly related to two main drivers:“odor-factory”and“odorsfarms”.The proposed approach has several theoretical and practical implications:text mining may quickly and efficiently address citizen needs,providing the basis toward automating(even partially)the complaint process;and the DPSIR framework might support the planning and organization of information and the identification of stakeholder concerns and priorities,as well as metrics and indicators for their assessment.Therefore,integration of the DPSIR framework with the text mining of environmental complaints might generate a comprehensive environmental knowledge base as a prerequisite for a wider exploitation of analysis to support decision-making processes and environmental management activities. 展开更多
关键词 Environmental complaints Text mining approach term frequency-inverse document frequency(tf-idf) DRIVER PRESSURE STATE impact and response(DPSIR)framework Semantic network analysis Regional Agency for Prevention Environment and Energy(Arpae)
下载PDF
Spark 平台下基于加权词向量的文本分类方法
7
作者 蔡宇翔 王佳斌 郑天华 《现代计算机》 2022年第3期25-30,共6页
针对Spark平台下文本分类中文本表示方法不够完善,导致分类准确率低的问题,结合SparkML下的TF-IDF算法和Word2vec模型,提出一种基于SparkML的加权词向量文本表示方法。首先对文本进行分词,去停用词等预处理,基于SparkML计算出每个词语... 针对Spark平台下文本分类中文本表示方法不够完善,导致分类准确率低的问题,结合SparkML下的TF-IDF算法和Word2vec模型,提出一种基于SparkML的加权词向量文本表示方法。首先对文本进行分词,去停用词等预处理,基于SparkML计算出每个词语的词频和逆文档频率,同时计算词语的词向量。使用词语的TF-IDF值作为词向量的权重,将文本表示为加权词向量,再使用SVM分类器进行分类。在THUNews数据集上进行实验。实验结果表明,该方法相比于传统的TF-IDF算法、平均Word2Vec词向量文本表示,可以提升分类的精度。 展开更多
关键词 SPARK 文本分类 tf-idf(term frequency-inverse document frequency) Word2Vec 支持向量机 文本表示
下载PDF
Hybrid Approach to Document Anomaly Detection:An Application to Facilitate RPA in Title Insurance
8
作者 Abhijit Guha Debabrata Samanta 《International Journal of Automation and computing》 EI CSCD 2021年第1期55-72,共18页
Anomaly detection(AD)is an important aspect of various domains and title insurance(TI)is no exception.Robotic process automation(RPA)is taking over manual tasks in TI business processes,but it has its limitations with... Anomaly detection(AD)is an important aspect of various domains and title insurance(TI)is no exception.Robotic process automation(RPA)is taking over manual tasks in TI business processes,but it has its limitations without the support of artificial intelligence(AI)and machine learning(ML).With increasing data dimensionality and in composite population scenarios,the complexity of detecting anomalies increases and AD in automated document management systems(ADMS)is the least explored domain.Deep learning,being the fastest maturing technology can be combined along with traditional anomaly detectors to facilitate and improve the RPAs in TI.We present a hybrid model for AD,using autoencoders(AE)and a one-class support vector machine(OSVM).In the present study,OSVM receives input features representing real-time documents from the TI business,orchestrated and with dimensions reduced by AE.The results obtained from multiple experiments are comparable with traditional methods and within a business acceptable range,regarding accuracy and performance. 展开更多
关键词 Anomaly detection title insurance autoencoder one-class support vector machine(OSVM) term frequency-inverse document frequency(tf-idf) robotic process automation dimensionality reduction
原文传递
Spontaneous Language Analysis in Alzheimer’s Disease:Evaluation of Natural Language Processing Technique for Analyzing Lexical Performance
9
作者 Liu Ning Yuan Zhenming 《Journal of Shanghai Jiaotong university(Science)》 EI 2022年第2期160-167,共8页
Language disorder,a common manifestation of Alzheimer’s disease(AD),has attracted widespread attention in recent years.This paper uses a novel natural language processing(NLP)method,compared with latest deep learning... Language disorder,a common manifestation of Alzheimer’s disease(AD),has attracted widespread attention in recent years.This paper uses a novel natural language processing(NLP)method,compared with latest deep learning technology,to detect AD and explore the lexical performance.Our proposed approach is based on two stages.First,the dialogue contents are summarized into two categories with the same category.Second,term frequency—inverse document frequency(TF-IDF)algorithm is used to extract the keywords of transcripts,and the similarity of keywords between the groups was calculated separately by cosine distance.Several deep learning methods are used to compare the performance.In the meanwhile,keywords with the best performance are used to analyze AD patients’lexical performance.In the Predictive Challenge of Alzheimer’s Disease held by iFlytek in 2019,the proposed AD diagnosis model achieves a better performance in binary classification by adjusting the number of keywords.The F1 score of the model has a considerable improvement over the baseline of 75.4%,and the training process of which is simple and efficient.We analyze the keywords of the model and find that AD patients use less noun and verb than normal controls.A computer-assisted AD diagnosis model on small Chinese dataset is proposed in this paper,which provides a potential way for assisting diagnosis of AD and analyzing lexical performance in clinical setting. 展开更多
关键词 natural language processing(NLP) Alzheimer's disease(AD) mild cognitive impairment term frequency-inverse document frequency(tf-idf) bag of words
原文传递
Fusion Model for Tentative Diagnosis Inference Based on Clinical Narratives
10
作者 Ying Yu Junwen Duan Min Li 《Tsinghua Science and Technology》 SCIE EI CAS CSCD 2023年第4期686-695,共10页
In general,physicians make a preliminary diagnosis based on patients’admission narratives and admission conditions,largely depending on their experiences and professional knowledge.An automatic and accurate tentative... In general,physicians make a preliminary diagnosis based on patients’admission narratives and admission conditions,largely depending on their experiences and professional knowledge.An automatic and accurate tentative diagnosis based on clinical narratives would be of great importance to physicians,particularly in the shortage of medical resources.Despite its great value,little work has been conducted on this diagnosis method.Thus,in this study,we propose a fusion model that integrates the semantic and symptom features contained in the clinical text.The semantic features of the input text are initially captured by an attention-based Bidirectional Long Short-Term Memory(BiLSTM)network.The symptom concepts,recognized from the input text,are then vectorized by using the term frequency-inverse document frequency method based on the relations between symptoms and diseases.Finally,two fusion strategies are utilized to recommend the most potential candidate for the international classification of diseases code.Model training and evaluation are performed on a public clinical dataset.The results show that both fusion strategies achieved a promising performance,in which the best performance obtained a top-3 accuracy of 0.7412. 展开更多
关键词 tentative diagnosis clinical narrative Bidirectional Long Short-term Memory(BiLSTM) term frequencyInverse document frequency(tf-idf) fusion strategy
原文传递
上一页 1 下一页 到第
使用帮助 返回顶部