This study is an exploratory analysis of applying natural language processing techniques such as Term Frequency-Inverse Document Frequency and Sentiment Analysis on Twitter data. The uniqueness of this work is establi...This study is an exploratory analysis of applying natural language processing techniques such as Term Frequency-Inverse Document Frequency and Sentiment Analysis on Twitter data. The uniqueness of this work is established by determining the overall sentiment of a politician’s tweets based on TF-IDF values of terms used in their published tweets. By calculating the TF-IDF value of terms from the corpus, this work displays the correlation between TF-IDF score and polarity. The results of this work show that calculating the TF-IDF score of the corpus allows for a more accurate representation of the overall polarity since terms are given a weight based on their uniqueness and relevance rather than just the frequency at which they appear in the corpus.展开更多
Laterally with the birth of the Internet,the fast growth of mobile stra-tegies has democratised content production owing to the widespread usage of social media,resulting in a detonation of short informal writings.Twi...Laterally with the birth of the Internet,the fast growth of mobile stra-tegies has democratised content production owing to the widespread usage of social media,resulting in a detonation of short informal writings.Twitter is micro-blogging short text and social networking services,with posted millions of quick messages.Twitter analysis addresses the topic of interpreting users’tweets in terms of ideas,interests,and views in a range of settings andfields.This type of study can be useful for a variation of academics and applications that need knowing people’s perspectives on a given topic or event.Although sentiment examination of these texts is useful for a variety of reasons,it is typically seen as a difficult undertaking due to the fact that these messages are frequently short,informal,loud,and rich in linguistic ambiguities such as polysemy.Furthermore,most contemporary sentiment analysis algorithms are based on clean data.In this paper,we offers a machine-learning-based sentiment analysis method that extracts features from Term Frequency and Inverse Document Frequency(TF-IDF)and needs to apply deep intelligent wordnet lemmatize to improve the excellence of tweets by removing noise.We also utilise the Random Forest network to detect the emotion of a tweet.To authenticate the proposed approach performance,we conduct extensive tests on publically accessible datasets,and thefindings reveal that the suggested technique significantly outperforms sentiment classification in multi-class emotion text data.展开更多
首先,数字服务平台数据分析利用Python爬虫技术采集广西数字服务平台的馆藏信息、图书信息、借阅信息等。其次,爬取豆瓣年度关注书籍的评论,运用后羿采集器对豆瓣图书信息进行采集,并将清洗后的数据通过Pandas和Matplotlib等可视化工具...首先,数字服务平台数据分析利用Python爬虫技术采集广西数字服务平台的馆藏信息、图书信息、借阅信息等。其次,爬取豆瓣年度关注书籍的评论,运用后羿采集器对豆瓣图书信息进行采集,并将清洗后的数据通过Pandas和Matplotlib等可视化工具进行可视化展示。最后,运用词频-逆文本频率指数(Term Frequency Inverse Document Frequency,TF-IDF)算法对评论进行分析,对广西数字图书馆和豆瓣图书等各类数据进行对比分析,得出更加符合读者需求的数据信息,便于优化数字图书借阅服务,为数字图书馆提供数据支撑,以便平台管理员能够快速、有效地对数据服务平台进行决策。展开更多
文本分类是信息检索的关键问题之一.提取更多的可信反例和构造准确高效的分类器是PU(positive and unlabeled)文本分类的两个重要问题.然而,在现有的可信反例提取方法中,很多方法提取的可信反例数量较少,构建的分类器质量有待提高.分别...文本分类是信息检索的关键问题之一.提取更多的可信反例和构造准确高效的分类器是PU(positive and unlabeled)文本分类的两个重要问题.然而,在现有的可信反例提取方法中,很多方法提取的可信反例数量较少,构建的分类器质量有待提高.分别针对这两个重要步骤提供了一种基于聚类的半监督主动分类方法.与传统的反例提取方法不同,利用聚类技术和正例文档应与反例文档共享尽可能少的特征项这一特点,从未标识数据集中尽可能多地移除正例,从而可以获得更多的可信反例.结合SVM主动学习和改进的Rocchio构建分类器,并采用改进的TFIDF(term frequency inverse document frequency)进行特征提取,可以显著提高分类的准确度.分别在3个不同的数据集中测试了分类结果(RCV1,Reuters-21578,20 Newsgoups).实验结果表明,基于聚类寻找可信反例可以在保持较低错误率的情况下获取更多的可信反例,而且主动学习方法的引入也显著提升了分类精度.展开更多
文摘This study is an exploratory analysis of applying natural language processing techniques such as Term Frequency-Inverse Document Frequency and Sentiment Analysis on Twitter data. The uniqueness of this work is established by determining the overall sentiment of a politician’s tweets based on TF-IDF values of terms used in their published tweets. By calculating the TF-IDF value of terms from the corpus, this work displays the correlation between TF-IDF score and polarity. The results of this work show that calculating the TF-IDF score of the corpus allows for a more accurate representation of the overall polarity since terms are given a weight based on their uniqueness and relevance rather than just the frequency at which they appear in the corpus.
文摘Laterally with the birth of the Internet,the fast growth of mobile stra-tegies has democratised content production owing to the widespread usage of social media,resulting in a detonation of short informal writings.Twitter is micro-blogging short text and social networking services,with posted millions of quick messages.Twitter analysis addresses the topic of interpreting users’tweets in terms of ideas,interests,and views in a range of settings andfields.This type of study can be useful for a variation of academics and applications that need knowing people’s perspectives on a given topic or event.Although sentiment examination of these texts is useful for a variety of reasons,it is typically seen as a difficult undertaking due to the fact that these messages are frequently short,informal,loud,and rich in linguistic ambiguities such as polysemy.Furthermore,most contemporary sentiment analysis algorithms are based on clean data.In this paper,we offers a machine-learning-based sentiment analysis method that extracts features from Term Frequency and Inverse Document Frequency(TF-IDF)and needs to apply deep intelligent wordnet lemmatize to improve the excellence of tweets by removing noise.We also utilise the Random Forest network to detect the emotion of a tweet.To authenticate the proposed approach performance,we conduct extensive tests on publically accessible datasets,and thefindings reveal that the suggested technique significantly outperforms sentiment classification in multi-class emotion text data.
文摘首先,数字服务平台数据分析利用Python爬虫技术采集广西数字服务平台的馆藏信息、图书信息、借阅信息等。其次,爬取豆瓣年度关注书籍的评论,运用后羿采集器对豆瓣图书信息进行采集,并将清洗后的数据通过Pandas和Matplotlib等可视化工具进行可视化展示。最后,运用词频-逆文本频率指数(Term Frequency Inverse Document Frequency,TF-IDF)算法对评论进行分析,对广西数字图书馆和豆瓣图书等各类数据进行对比分析,得出更加符合读者需求的数据信息,便于优化数字图书借阅服务,为数字图书馆提供数据支撑,以便平台管理员能够快速、有效地对数据服务平台进行决策。
文摘文本分类是信息检索的关键问题之一.提取更多的可信反例和构造准确高效的分类器是PU(positive and unlabeled)文本分类的两个重要问题.然而,在现有的可信反例提取方法中,很多方法提取的可信反例数量较少,构建的分类器质量有待提高.分别针对这两个重要步骤提供了一种基于聚类的半监督主动分类方法.与传统的反例提取方法不同,利用聚类技术和正例文档应与反例文档共享尽可能少的特征项这一特点,从未标识数据集中尽可能多地移除正例,从而可以获得更多的可信反例.结合SVM主动学习和改进的Rocchio构建分类器,并采用改进的TFIDF(term frequency inverse document frequency)进行特征提取,可以显著提高分类的准确度.分别在3个不同的数据集中测试了分类结果(RCV1,Reuters-21578,20 Newsgoups).实验结果表明,基于聚类寻找可信反例可以在保持较低错误率的情况下获取更多的可信反例,而且主动学习方法的引入也显著提升了分类精度.