This study is an exploratory analysis of applying natural language processing techniques such as Term Frequency-Inverse Document Frequency and Sentiment Analysis on Twitter data. The uniqueness of this work is establi...This study is an exploratory analysis of applying natural language processing techniques such as Term Frequency-Inverse Document Frequency and Sentiment Analysis on Twitter data. The uniqueness of this work is established by determining the overall sentiment of a politician’s tweets based on TF-IDF values of terms used in their published tweets. By calculating the TF-IDF value of terms from the corpus, this work displays the correlation between TF-IDF score and polarity. The results of this work show that calculating the TF-IDF score of the corpus allows for a more accurate representation of the overall polarity since terms are given a weight based on their uniqueness and relevance rather than just the frequency at which they appear in the corpus.展开更多
Individuals,local communities,environmental associations,private organizations,and public representatives and bodies may all be aggrieved by environmental problems concerning poor air quality,illegal waste disposal,wa...Individuals,local communities,environmental associations,private organizations,and public representatives and bodies may all be aggrieved by environmental problems concerning poor air quality,illegal waste disposal,water contamination,and general pollution.Environmental complaints represent the expressions of dissatisfaction with these issues.As the timeconsuming of managing a large number of complaints,text mining may be useful for automatically extracting information on stakeholder priorities and concerns.The paper used text mining and semantic network analysis to crawl relevant keywords about environmental complaints from two online complaint submission systems:online claim submission system of Regional Agency for Prevention,Environment and Energy(Arpae)(“Contact Arpae”);and Arpae's internal platform for environmental pollution(“Environmental incident reporting portal”)in the Emilia-Romagna Region,Italy.We evaluated the total of 2477 records and classified this information based on the claim topic(air pollution,water pollution,noise pollution,waste,odor,soil,weather-climate,sea-coast,and electromagnetic radiation)and geographical distribution.Then,this paper used natural language processing to extract keywords from the dataset,and classified keywords ranking higher in Term Frequency-Inverse Document Frequency(TF-IDF)based on the driver,pressure,state,impact,and response(DPSIR)framework.This study provided a systemic approach to understanding the interaction between people and environment in different geographical contexts and builds sustainable and healthy communities.The results showed that most complaints are from the public and associated with air pollution and odor.Factories(particularly foundries and ceramic industries)and farms are identified as the drivers of environmental issues.Citizen believed that environmental issues mainly affect human well-being.Moreover,the keywords of“odor”,“report”,“request”,“presence”,“municipality”,and“hours”were the most influential and meaningful concepts,as demonstrated by their high degree and betweenness centrality values.Keywords connecting odor(classified as impacts)and air pollution(classified as state)were the most important(such as“odor-burnt plastic”and“odor-acrid”).Complainants perceived odor annoyance as a primary environmental concern,possibly related to two main drivers:“odor-factory”and“odorsfarms”.The proposed approach has several theoretical and practical implications:text mining may quickly and efficiently address citizen needs,providing the basis toward automating(even partially)the complaint process;and the DPSIR framework might support the planning and organization of information and the identification of stakeholder concerns and priorities,as well as metrics and indicators for their assessment.Therefore,integration of the DPSIR framework with the text mining of environmental complaints might generate a comprehensive environmental knowledge base as a prerequisite for a wider exploitation of analysis to support decision-making processes and environmental management activities.展开更多
Laterally with the birth of the Internet,the fast growth of mobile stra-tegies has democratised content production owing to the widespread usage of social media,resulting in a detonation of short informal writings.Twi...Laterally with the birth of the Internet,the fast growth of mobile stra-tegies has democratised content production owing to the widespread usage of social media,resulting in a detonation of short informal writings.Twitter is micro-blogging short text and social networking services,with posted millions of quick messages.Twitter analysis addresses the topic of interpreting users’tweets in terms of ideas,interests,and views in a range of settings andfields.This type of study can be useful for a variation of academics and applications that need knowing people’s perspectives on a given topic or event.Although sentiment examination of these texts is useful for a variety of reasons,it is typically seen as a difficult undertaking due to the fact that these messages are frequently short,informal,loud,and rich in linguistic ambiguities such as polysemy.Furthermore,most contemporary sentiment analysis algorithms are based on clean data.In this paper,we offers a machine-learning-based sentiment analysis method that extracts features from Term Frequency and Inverse Document Frequency(TF-IDF)and needs to apply deep intelligent wordnet lemmatize to improve the excellence of tweets by removing noise.We also utilise the Random Forest network to detect the emotion of a tweet.To authenticate the proposed approach performance,we conduct extensive tests on publically accessible datasets,and thefindings reveal that the suggested technique significantly outperforms sentiment classification in multi-class emotion text data.展开更多
首先,数字服务平台数据分析利用Python爬虫技术采集广西数字服务平台的馆藏信息、图书信息、借阅信息等。其次,爬取豆瓣年度关注书籍的评论,运用后羿采集器对豆瓣图书信息进行采集,并将清洗后的数据通过Pandas和Matplotlib等可视化工具...首先,数字服务平台数据分析利用Python爬虫技术采集广西数字服务平台的馆藏信息、图书信息、借阅信息等。其次,爬取豆瓣年度关注书籍的评论,运用后羿采集器对豆瓣图书信息进行采集,并将清洗后的数据通过Pandas和Matplotlib等可视化工具进行可视化展示。最后,运用词频-逆文本频率指数(Term Frequency Inverse Document Frequency,TF-IDF)算法对评论进行分析,对广西数字图书馆和豆瓣图书等各类数据进行对比分析,得出更加符合读者需求的数据信息,便于优化数字图书借阅服务,为数字图书馆提供数据支撑,以便平台管理员能够快速、有效地对数据服务平台进行决策。展开更多
文本分类是信息检索的关键问题之一.提取更多的可信反例和构造准确高效的分类器是PU(positive and unlabeled)文本分类的两个重要问题.然而,在现有的可信反例提取方法中,很多方法提取的可信反例数量较少,构建的分类器质量有待提高.分别...文本分类是信息检索的关键问题之一.提取更多的可信反例和构造准确高效的分类器是PU(positive and unlabeled)文本分类的两个重要问题.然而,在现有的可信反例提取方法中,很多方法提取的可信反例数量较少,构建的分类器质量有待提高.分别针对这两个重要步骤提供了一种基于聚类的半监督主动分类方法.与传统的反例提取方法不同,利用聚类技术和正例文档应与反例文档共享尽可能少的特征项这一特点,从未标识数据集中尽可能多地移除正例,从而可以获得更多的可信反例.结合SVM主动学习和改进的Rocchio构建分类器,并采用改进的TFIDF(term frequency inverse document frequency)进行特征提取,可以显著提高分类的准确度.分别在3个不同的数据集中测试了分类结果(RCV1,Reuters-21578,20 Newsgoups).实验结果表明,基于聚类寻找可信反例可以在保持较低错误率的情况下获取更多的可信反例,而且主动学习方法的引入也显著提升了分类精度.展开更多
文摘This study is an exploratory analysis of applying natural language processing techniques such as Term Frequency-Inverse Document Frequency and Sentiment Analysis on Twitter data. The uniqueness of this work is established by determining the overall sentiment of a politician’s tweets based on TF-IDF values of terms used in their published tweets. By calculating the TF-IDF value of terms from the corpus, this work displays the correlation between TF-IDF score and polarity. The results of this work show that calculating the TF-IDF score of the corpus allows for a more accurate representation of the overall polarity since terms are given a weight based on their uniqueness and relevance rather than just the frequency at which they appear in the corpus.
文摘Individuals,local communities,environmental associations,private organizations,and public representatives and bodies may all be aggrieved by environmental problems concerning poor air quality,illegal waste disposal,water contamination,and general pollution.Environmental complaints represent the expressions of dissatisfaction with these issues.As the timeconsuming of managing a large number of complaints,text mining may be useful for automatically extracting information on stakeholder priorities and concerns.The paper used text mining and semantic network analysis to crawl relevant keywords about environmental complaints from two online complaint submission systems:online claim submission system of Regional Agency for Prevention,Environment and Energy(Arpae)(“Contact Arpae”);and Arpae's internal platform for environmental pollution(“Environmental incident reporting portal”)in the Emilia-Romagna Region,Italy.We evaluated the total of 2477 records and classified this information based on the claim topic(air pollution,water pollution,noise pollution,waste,odor,soil,weather-climate,sea-coast,and electromagnetic radiation)and geographical distribution.Then,this paper used natural language processing to extract keywords from the dataset,and classified keywords ranking higher in Term Frequency-Inverse Document Frequency(TF-IDF)based on the driver,pressure,state,impact,and response(DPSIR)framework.This study provided a systemic approach to understanding the interaction between people and environment in different geographical contexts and builds sustainable and healthy communities.The results showed that most complaints are from the public and associated with air pollution and odor.Factories(particularly foundries and ceramic industries)and farms are identified as the drivers of environmental issues.Citizen believed that environmental issues mainly affect human well-being.Moreover,the keywords of“odor”,“report”,“request”,“presence”,“municipality”,and“hours”were the most influential and meaningful concepts,as demonstrated by their high degree and betweenness centrality values.Keywords connecting odor(classified as impacts)and air pollution(classified as state)were the most important(such as“odor-burnt plastic”and“odor-acrid”).Complainants perceived odor annoyance as a primary environmental concern,possibly related to two main drivers:“odor-factory”and“odorsfarms”.The proposed approach has several theoretical and practical implications:text mining may quickly and efficiently address citizen needs,providing the basis toward automating(even partially)the complaint process;and the DPSIR framework might support the planning and organization of information and the identification of stakeholder concerns and priorities,as well as metrics and indicators for their assessment.Therefore,integration of the DPSIR framework with the text mining of environmental complaints might generate a comprehensive environmental knowledge base as a prerequisite for a wider exploitation of analysis to support decision-making processes and environmental management activities.
文摘Laterally with the birth of the Internet,the fast growth of mobile stra-tegies has democratised content production owing to the widespread usage of social media,resulting in a detonation of short informal writings.Twitter is micro-blogging short text and social networking services,with posted millions of quick messages.Twitter analysis addresses the topic of interpreting users’tweets in terms of ideas,interests,and views in a range of settings andfields.This type of study can be useful for a variation of academics and applications that need knowing people’s perspectives on a given topic or event.Although sentiment examination of these texts is useful for a variety of reasons,it is typically seen as a difficult undertaking due to the fact that these messages are frequently short,informal,loud,and rich in linguistic ambiguities such as polysemy.Furthermore,most contemporary sentiment analysis algorithms are based on clean data.In this paper,we offers a machine-learning-based sentiment analysis method that extracts features from Term Frequency and Inverse Document Frequency(TF-IDF)and needs to apply deep intelligent wordnet lemmatize to improve the excellence of tweets by removing noise.We also utilise the Random Forest network to detect the emotion of a tweet.To authenticate the proposed approach performance,we conduct extensive tests on publically accessible datasets,and thefindings reveal that the suggested technique significantly outperforms sentiment classification in multi-class emotion text data.
文摘首先,数字服务平台数据分析利用Python爬虫技术采集广西数字服务平台的馆藏信息、图书信息、借阅信息等。其次,爬取豆瓣年度关注书籍的评论,运用后羿采集器对豆瓣图书信息进行采集,并将清洗后的数据通过Pandas和Matplotlib等可视化工具进行可视化展示。最后,运用词频-逆文本频率指数(Term Frequency Inverse Document Frequency,TF-IDF)算法对评论进行分析,对广西数字图书馆和豆瓣图书等各类数据进行对比分析,得出更加符合读者需求的数据信息,便于优化数字图书借阅服务,为数字图书馆提供数据支撑,以便平台管理员能够快速、有效地对数据服务平台进行决策。
文摘文本分类是信息检索的关键问题之一.提取更多的可信反例和构造准确高效的分类器是PU(positive and unlabeled)文本分类的两个重要问题.然而,在现有的可信反例提取方法中,很多方法提取的可信反例数量较少,构建的分类器质量有待提高.分别针对这两个重要步骤提供了一种基于聚类的半监督主动分类方法.与传统的反例提取方法不同,利用聚类技术和正例文档应与反例文档共享尽可能少的特征项这一特点,从未标识数据集中尽可能多地移除正例,从而可以获得更多的可信反例.结合SVM主动学习和改进的Rocchio构建分类器,并采用改进的TFIDF(term frequency inverse document frequency)进行特征提取,可以显著提高分类的准确度.分别在3个不同的数据集中测试了分类结果(RCV1,Reuters-21578,20 Newsgoups).实验结果表明,基于聚类寻找可信反例可以在保持较低错误率的情况下获取更多的可信反例,而且主动学习方法的引入也显著提升了分类精度.