This study is an exploratory analysis of applying natural language processing techniques such as Term Frequency-Inverse Document Frequency and Sentiment Analysis on Twitter data. The uniqueness of this work is establi...This study is an exploratory analysis of applying natural language processing techniques such as Term Frequency-Inverse Document Frequency and Sentiment Analysis on Twitter data. The uniqueness of this work is established by determining the overall sentiment of a politician’s tweets based on TF-IDF values of terms used in their published tweets. By calculating the TF-IDF value of terms from the corpus, this work displays the correlation between TF-IDF score and polarity. The results of this work show that calculating the TF-IDF score of the corpus allows for a more accurate representation of the overall polarity since terms are given a weight based on their uniqueness and relevance rather than just the frequency at which they appear in the corpus.展开更多
Individuals,local communities,environmental associations,private organizations,and public representatives and bodies may all be aggrieved by environmental problems concerning poor air quality,illegal waste disposal,wa...Individuals,local communities,environmental associations,private organizations,and public representatives and bodies may all be aggrieved by environmental problems concerning poor air quality,illegal waste disposal,water contamination,and general pollution.Environmental complaints represent the expressions of dissatisfaction with these issues.As the timeconsuming of managing a large number of complaints,text mining may be useful for automatically extracting information on stakeholder priorities and concerns.The paper used text mining and semantic network analysis to crawl relevant keywords about environmental complaints from two online complaint submission systems:online claim submission system of Regional Agency for Prevention,Environment and Energy(Arpae)(“Contact Arpae”);and Arpae's internal platform for environmental pollution(“Environmental incident reporting portal”)in the Emilia-Romagna Region,Italy.We evaluated the total of 2477 records and classified this information based on the claim topic(air pollution,water pollution,noise pollution,waste,odor,soil,weather-climate,sea-coast,and electromagnetic radiation)and geographical distribution.Then,this paper used natural language processing to extract keywords from the dataset,and classified keywords ranking higher in Term Frequency-Inverse Document Frequency(TF-IDF)based on the driver,pressure,state,impact,and response(DPSIR)framework.This study provided a systemic approach to understanding the interaction between people and environment in different geographical contexts and builds sustainable and healthy communities.The results showed that most complaints are from the public and associated with air pollution and odor.Factories(particularly foundries and ceramic industries)and farms are identified as the drivers of environmental issues.Citizen believed that environmental issues mainly affect human well-being.Moreover,the keywords of“odor”,“report”,“request”,“presence”,“municipality”,and“hours”were the most influential and meaningful concepts,as demonstrated by their high degree and betweenness centrality values.Keywords connecting odor(classified as impacts)and air pollution(classified as state)were the most important(such as“odor-burnt plastic”and“odor-acrid”).Complainants perceived odor annoyance as a primary environmental concern,possibly related to two main drivers:“odor-factory”and“odorsfarms”.The proposed approach has several theoretical and practical implications:text mining may quickly and efficiently address citizen needs,providing the basis toward automating(even partially)the complaint process;and the DPSIR framework might support the planning and organization of information and the identification of stakeholder concerns and priorities,as well as metrics and indicators for their assessment.Therefore,integration of the DPSIR framework with the text mining of environmental complaints might generate a comprehensive environmental knowledge base as a prerequisite for a wider exploitation of analysis to support decision-making processes and environmental management activities.展开更多
A new common phrase scoring method is proposed according to term frequency-inverse document frequency (TFIDF) and independence of the phrase. Combining the two properties can help identify more reasonable common phr...A new common phrase scoring method is proposed according to term frequency-inverse document frequency (TFIDF) and independence of the phrase. Combining the two properties can help identify more reasonable common phrases, which improve the accuracy of clustering. Also, the equation to measure the in-dependence of a phrase is proposed in this paper. The new algorithm which improves suffix tree clustering algorithm (STC) is named as improved suffix tree clustering (ISTC). To validate the proposed algorithm, a prototype system is implemented and used to cluster several groups of web search results obtained from Google search engine. Experimental results show that the improved algorithm offers higher accuracy than traditional suffix tree clustering.展开更多
Anomaly detection(AD)is an important aspect of various domains and title insurance(TI)is no exception.Robotic process automation(RPA)is taking over manual tasks in TI business processes,but it has its limitations with...Anomaly detection(AD)is an important aspect of various domains and title insurance(TI)is no exception.Robotic process automation(RPA)is taking over manual tasks in TI business processes,but it has its limitations without the support of artificial intelligence(AI)and machine learning(ML).With increasing data dimensionality and in composite population scenarios,the complexity of detecting anomalies increases and AD in automated document management systems(ADMS)is the least explored domain.Deep learning,being the fastest maturing technology can be combined along with traditional anomaly detectors to facilitate and improve the RPAs in TI.We present a hybrid model for AD,using autoencoders(AE)and a one-class support vector machine(OSVM).In the present study,OSVM receives input features representing real-time documents from the TI business,orchestrated and with dimensions reduced by AE.The results obtained from multiple experiments are comparable with traditional methods and within a business acceptable range,regarding accuracy and performance.展开更多
Language disorder,a common manifestation of Alzheimer’s disease(AD),has attracted widespread attention in recent years.This paper uses a novel natural language processing(NLP)method,compared with latest deep learning...Language disorder,a common manifestation of Alzheimer’s disease(AD),has attracted widespread attention in recent years.This paper uses a novel natural language processing(NLP)method,compared with latest deep learning technology,to detect AD and explore the lexical performance.Our proposed approach is based on two stages.First,the dialogue contents are summarized into two categories with the same category.Second,term frequency—inverse document frequency(TF-IDF)algorithm is used to extract the keywords of transcripts,and the similarity of keywords between the groups was calculated separately by cosine distance.Several deep learning methods are used to compare the performance.In the meanwhile,keywords with the best performance are used to analyze AD patients’lexical performance.In the Predictive Challenge of Alzheimer’s Disease held by iFlytek in 2019,the proposed AD diagnosis model achieves a better performance in binary classification by adjusting the number of keywords.The F1 score of the model has a considerable improvement over the baseline of 75.4%,and the training process of which is simple and efficient.We analyze the keywords of the model and find that AD patients use less noun and verb than normal controls.A computer-assisted AD diagnosis model on small Chinese dataset is proposed in this paper,which provides a potential way for assisting diagnosis of AD and analyzing lexical performance in clinical setting.展开更多
文摘This study is an exploratory analysis of applying natural language processing techniques such as Term Frequency-Inverse Document Frequency and Sentiment Analysis on Twitter data. The uniqueness of this work is established by determining the overall sentiment of a politician’s tweets based on TF-IDF values of terms used in their published tweets. By calculating the TF-IDF value of terms from the corpus, this work displays the correlation between TF-IDF score and polarity. The results of this work show that calculating the TF-IDF score of the corpus allows for a more accurate representation of the overall polarity since terms are given a weight based on their uniqueness and relevance rather than just the frequency at which they appear in the corpus.
文摘Individuals,local communities,environmental associations,private organizations,and public representatives and bodies may all be aggrieved by environmental problems concerning poor air quality,illegal waste disposal,water contamination,and general pollution.Environmental complaints represent the expressions of dissatisfaction with these issues.As the timeconsuming of managing a large number of complaints,text mining may be useful for automatically extracting information on stakeholder priorities and concerns.The paper used text mining and semantic network analysis to crawl relevant keywords about environmental complaints from two online complaint submission systems:online claim submission system of Regional Agency for Prevention,Environment and Energy(Arpae)(“Contact Arpae”);and Arpae's internal platform for environmental pollution(“Environmental incident reporting portal”)in the Emilia-Romagna Region,Italy.We evaluated the total of 2477 records and classified this information based on the claim topic(air pollution,water pollution,noise pollution,waste,odor,soil,weather-climate,sea-coast,and electromagnetic radiation)and geographical distribution.Then,this paper used natural language processing to extract keywords from the dataset,and classified keywords ranking higher in Term Frequency-Inverse Document Frequency(TF-IDF)based on the driver,pressure,state,impact,and response(DPSIR)framework.This study provided a systemic approach to understanding the interaction between people and environment in different geographical contexts and builds sustainable and healthy communities.The results showed that most complaints are from the public and associated with air pollution and odor.Factories(particularly foundries and ceramic industries)and farms are identified as the drivers of environmental issues.Citizen believed that environmental issues mainly affect human well-being.Moreover,the keywords of“odor”,“report”,“request”,“presence”,“municipality”,and“hours”were the most influential and meaningful concepts,as demonstrated by their high degree and betweenness centrality values.Keywords connecting odor(classified as impacts)and air pollution(classified as state)were the most important(such as“odor-burnt plastic”and“odor-acrid”).Complainants perceived odor annoyance as a primary environmental concern,possibly related to two main drivers:“odor-factory”and“odorsfarms”.The proposed approach has several theoretical and practical implications:text mining may quickly and efficiently address citizen needs,providing the basis toward automating(even partially)the complaint process;and the DPSIR framework might support the planning and organization of information and the identification of stakeholder concerns and priorities,as well as metrics and indicators for their assessment.Therefore,integration of the DPSIR framework with the text mining of environmental complaints might generate a comprehensive environmental knowledge base as a prerequisite for a wider exploitation of analysis to support decision-making processes and environmental management activities.
基金Foundation item: Supported by the National Natural Science Foundation of China (60503020, 60503033, 60703086)Opening Foundation of Jiangsu Key Laboratory of Computer Information Processing Technology in Soochow Uni-versity (KJS0714)+1 种基金Research Foundation of Nanjing University of Posts and Telecommunications (NY207052, NY207082)National Natural Science Foundation of Jiangsu (BK2006094).
文摘A new common phrase scoring method is proposed according to term frequency-inverse document frequency (TFIDF) and independence of the phrase. Combining the two properties can help identify more reasonable common phrases, which improve the accuracy of clustering. Also, the equation to measure the in-dependence of a phrase is proposed in this paper. The new algorithm which improves suffix tree clustering algorithm (STC) is named as improved suffix tree clustering (ISTC). To validate the proposed algorithm, a prototype system is implemented and used to cluster several groups of web search results obtained from Google search engine. Experimental results show that the improved algorithm offers higher accuracy than traditional suffix tree clustering.
文摘Anomaly detection(AD)is an important aspect of various domains and title insurance(TI)is no exception.Robotic process automation(RPA)is taking over manual tasks in TI business processes,but it has its limitations without the support of artificial intelligence(AI)and machine learning(ML).With increasing data dimensionality and in composite population scenarios,the complexity of detecting anomalies increases and AD in automated document management systems(ADMS)is the least explored domain.Deep learning,being the fastest maturing technology can be combined along with traditional anomaly detectors to facilitate and improve the RPAs in TI.We present a hybrid model for AD,using autoencoders(AE)and a one-class support vector machine(OSVM).In the present study,OSVM receives input features representing real-time documents from the TI business,orchestrated and with dimensions reduced by AE.The results obtained from multiple experiments are comparable with traditional methods and within a business acceptable range,regarding accuracy and performance.
基金the Natural Science Foundation of Zhejiang Province(No.GF20F020063)the Fujian Province Young and Middle-Aged Teacher Education Research Project(No.JAT170480)。
文摘Language disorder,a common manifestation of Alzheimer’s disease(AD),has attracted widespread attention in recent years.This paper uses a novel natural language processing(NLP)method,compared with latest deep learning technology,to detect AD and explore the lexical performance.Our proposed approach is based on two stages.First,the dialogue contents are summarized into two categories with the same category.Second,term frequency—inverse document frequency(TF-IDF)algorithm is used to extract the keywords of transcripts,and the similarity of keywords between the groups was calculated separately by cosine distance.Several deep learning methods are used to compare the performance.In the meanwhile,keywords with the best performance are used to analyze AD patients’lexical performance.In the Predictive Challenge of Alzheimer’s Disease held by iFlytek in 2019,the proposed AD diagnosis model achieves a better performance in binary classification by adjusting the number of keywords.The F1 score of the model has a considerable improvement over the baseline of 75.4%,and the training process of which is simple and efficient.We analyze the keywords of the model and find that AD patients use less noun and verb than normal controls.A computer-assisted AD diagnosis model on small Chinese dataset is proposed in this paper,which provides a potential way for assisting diagnosis of AD and analyzing lexical performance in clinical setting.