Dimensionality Reduction of Distributed Vector Word Representations and Emoticon Stemming for Sentiment Analysis 被引量：3

Dimensionality Reduction of Distributed Vector Word Representations and Emoticon Stemming for Sentiment Analysis

下载PDF

导出

摘要 Social media platforms such as Twitter and the Internet Movie Database (IMDb) contain a vast amount of data which have applications in predictive sentiment analysis for movie sales, stock market fluctuations, brand opinion, or current events. Using a dataset taken from IMDb by Stanford, we identify some of the most significant phrases for identifying sentiment in a wide variety of movie reviews. Data from Twitter are especially attractive due to Twitter’s real-time nature through its streaming API. Effectively analyzing this data in a streaming fashion requires efficient models, which may be improved by reducing the dimensionality of input vectors. One way this has been done in the past is by using emoticons;we propose a method for further reducing these features through identifying common structure in emoticons with similar sentiment. We also examine the gender distribution of emoticon usage, finding tendencies towards certain emoticons to be disproportionate between males and females. Despite the roughly equal gender distribution on Twitter, emoticon usage is predominately female. Furthermore, we find that distributed vector representations, such as those produced by Word2Vec, may be reduced through feature selection. This analysis was done on a manually labeled sample of 1000 tweets from a new dataset, the Large Emoticon Corpus, which consisted of about 8.5 million tweets containing emoticons and was collecting over a five day period in May 2015. Additionally, using the common structure of similar emoticons, we are able to characterize positive and negative emoticons using two regular expressions which account for over 90% of emoticon usage in the Large Emoticon Corpus. Social media platforms such as Twitter and the Internet Movie Database (IMDb) contain a vast amount of data which have applications in predictive sentiment analysis for movie sales, stock market fluctuations, brand opinion, or current events. Using a dataset taken from IMDb by Stanford, we identify some of the most significant phrases for identifying sentiment in a wide variety of movie reviews. Data from Twitter are especially attractive due to Twitter’s real-time nature through its streaming API. Effectively analyzing this data in a streaming fashion requires efficient models, which may be improved by reducing the dimensionality of input vectors. One way this has been done in the past is by using emoticons;we propose a method for further reducing these features through identifying common structure in emoticons with similar sentiment. We also examine the gender distribution of emoticon usage, finding tendencies towards certain emoticons to be disproportionate between males and females. Despite the roughly equal gender distribution on Twitter, emoticon usage is predominately female. Furthermore, we find that distributed vector representations, such as those produced by Word2Vec, may be reduced through feature selection. This analysis was done on a manually labeled sample of 1000 tweets from a new dataset, the Large Emoticon Corpus, which consisted of about 8.5 million tweets containing emoticons and was collecting over a five day period in May 2015. Additionally, using the common structure of similar emoticons, we are able to characterize positive and negative emoticons using two regular expressions which account for over 90% of emoticon usage in the Large Emoticon Corpus.

作者 Brian Dickinson Michael Ganger Wei Hu

机构地区 Department of Computer Science

出处《Journal of Data Analysis and Information Processing》 2015年第4期153-162,共10页 数据分析和信息处理（英文）

关键词 NATURAL LANGUAGE Emoticon TWITTER Review Natural Language Emoticon Twitter Review

分类号 R73 [医药卫生—肿瘤]

引文网络
相关文献

同被引文献11

1齐浩翔,马莉媛,朱翌民.基于Word2Vec的疫情虚假信息检测方法[J].智能计算机与应用,2021,11(10):134-138. 被引量：3
2郑文超,徐鹏.利用word2vec对中文词进行聚类的研究[J].软件,2013,34(12):160-162. 被引量：29
3周练.Word2vec的工作原理及应用探究[J].科技情报开发与经济,2015,28(2):145-148. 被引量：101
4李跃鹏,金翠,及俊川.基于word2vec的关键词提取算法[J].科研信息化技术与应用,2015,6(4):54-59. 被引量：45
5李亚芳,贾彩燕,于剑.应用非负矩阵分解模型的社区发现方法综述[J].计算机科学与探索,2016,10(1):1-13. 被引量：9
6唐明,朱磊,邹显春.基于Word2Vec的一种文档向量表示[J].计算机科学,2016,43(6):214-217. 被引量：144
7刘良选,黄梦醒.一种面向词汇突发的连续时间主题模型[J].计算机工程,2016,42(11):195-201. 被引量：6
8李武波,张蕾,舒鑫.基于Seq2Seq的生成式自动问答系统应用与研究[J].现代计算机,2017,23(24):57-60. 被引量：7
9于游,付钰,吴晓平.中文文本分类方法综述[J].网络与信息安全学报,2019,5(5):1-8. 被引量：39
10邓君,孙绍丹,王阮,宋先智,李贺.基于Word2Vec和SVM的微博舆情情感演化分析[J].情报理论与实践,2020,43(8):112-119. 被引量：50

引证文献3

1张新豪,陈知行.一种基于局部词位置相对定位的非概率主题模型[J].计算机应用与软件,2020,37(9):215-220.
2张博玮,马晶,刘鹏,周泽宇.混合模糊匹配算法在军用公共计算环境下的应用研究[J].舰船电子工程,2020,40(12):91-95.
3白桢文,黄涛,秦小兵,吴健.基于改进FastText算法的整车检测质量问题判定[J].北京汽车,2022(5):27-32.

1Alireza Sarveniazi.An Actual Survey of Dimensionality Reduction[J].American Journal of Computational Mathematics,2014,4(2):55-72. 被引量：3
2Jing Wang,Fang Chen,Quanxue Gao.Discriminant Neighborhood Structure Embedding Using Trace Ratio Criterion for Image Recognition[J].Journal of Computer and Communications,2015,3(11):64-70.
3Uday Kant Jha,Peter Bajorski,Ernest Fokoue,Justine Vanden Heuvel,Jan van Aardt,Grant Anderson.Dimensionality Reduction of High-Dimensional Highly Correlated Multivariate Grapevine Dataset[J].Open Journal of Statistics,2017,7(4):702-717.
4Ebrahim Taherzadeh,Helmi Z. M. Shafri.Development of a Generic Model for the Detection of Roof Materials Based on an Object-Based Approach Using WorldView-2 Satellite Imagery[J].Advances in Remote Sensing,2013,2(4):312-321. 被引量：1
5Guolin Shao,Xingshu Chen,Xuemei Zeng,Lina Wang.Labeling Malicious Communication Samples Based on Semi-Supervised Deep Neural Network[J].China Communications,2019,16(11):183-200. 被引量：2
6Shi Ying Fang,Kong Yusheng.Comparison of stock market fluctuation spillover effects under the new and old international financial order Seedlings Clear[J].Financial Forum,2017,6(1):7-11.
7Lijun Mao.Research on the Geological Sourcing of Raohe Honey by Inductively Coupled Plasma Mass Spectrometry with Primary Composite Analysis and Forecasting Models[J].American Journal of Analytical Chemistry,2015,6(5):468-479.
8Thangairulappan Kathirvalavakumar,Jeyasingh Jebakumari Beulah Vasanthi.Face Recognition Based on Wavelet Packet Coefficients and Radial Basis Function Neural Networks[J].Journal of Intelligent Learning Systems and Applications,2013,5(2):115-122.
9Sankhanil Dey,Ranjan Ghosh.A Review of Existing 4-Bit Crypto S-Box Cryptanalysis Techniques and Two New Techniques with 4-Bit Boolean Functions for Cryptanalysis of 4-Bit Crypto S-Boxes[J].Advances in Pure Mathematics,2018,8(3):272-306.
10Hong Huang,Fulin Luo,Zezhong Ma,Hailiang Feng.Semi-Supervised Dimensionality Reduction of Hyperspectral Image Based on Sparse Multi-Manifold Learning[J].Journal of Computer and Communications,2015,3(11):33-39.

Journal of Data Analysis and Information Processing

2015年第4期

浏览历史

内容加载中请稍等...

Dimensionality Reduction of Distributed Vector Word Representations and Emoticon Stemming for Sentiment Analysis 被引量：3

同被引文献11

引证文献3

相关作者

相关机构

相关主题

浏览历史