期刊文献+

基于Bi-LSTM的不平衡样本文本分类模型

Bi-LSTM Based Text Classification Model for Imbalanced Samples
下载PDF
导出
摘要 情感分类任务通常是将有情感倾向的样本分为积极和消极两类。在大多数的理论模型中,这两类样本的数量都被假定是平衡的,而事实上,这两类样本在现实生活中一般是不平衡的。为解决这一问题,提出一种基于Focal损失的Bi-LSTM神经网络模型。首先,采集并标注了24,190条旅游评论作为该模型的数据集,其中积极样本远多于消极样本。为达到更好的分类结果,首先将样本数据集分为核心样本和非核心样本,并剔除非核心样本,提高数据质量;其次,用基于Focal损失的Bi-LSTM神经网络模型对数据进行训练;最后,对测试集进行验证并得到最终分类结果。通过准确率(accuracy)、F1、召回率(recall)和特异度指标(specificity)这四个评价指标判断模型优劣。一系列的实验结果显示,基于Focal损失的Bi-LSTM神经网络模型能够更好的解决样本不平衡的问题,与传统的LSTM模型分类方法相比,其分类性能更好。 In general, the task of sentiment classification usually divides samples with emotional tendencies into two categories: positive and negative. In most theoretical models, the number of samples in these two categories is assumed to be balanced, while in fact, the two categories are generally un-balanced in real life. In this paper, a Bi-LSTM network model based on Focal loss is proposed to clas-sify sentiment for unbalanced sample data. Firstly, 24,190 travel reviews were collected and la-beled as the dataset of the proposed model, whose positive samples were much more than negative samples. In order to achieve better classification results, the sample dataset is first divided into core and non-core samples, and the non-core samples are eliminated to improve the data quality;secondly, the data were trained with a Bi-LSTM neural network model based on Focal loss;finally, the test set is validated and the final classification results are obtained. Four evaluation metrics, accuracy, F1, recall and specificity, are used to judge the model merits. A series of experimental results show that the Bi-LSTM neural network model based on Focal loss can better solve the problem of sample imbalance and has better classification performance compared with the traditional LSTM model classification method.
作者 王欣羽 李薇
机构地区 燕山大学理学院
出处 《计算机科学与应用》 2023年第11期1989-1999,共11页 Computer Science and Application
  • 相关文献

参考文献10

二级参考文献94

  • 1易勇,何中市,李良炎,周剑勇,瞿义玻.基于遗传算法改进诗词风格判别的研究[J].计算机科学,2005,32(7):156-158. 被引量:6
  • 2Baeza-Yates R,Ribeiro-Neto B.Modern Information Retrieval[M].New York:ACM press,1999.
  • 3Manning C D,Schütze H.Foundations of Statistical NaturalLanguage Processing [M].Cambridge:MIT press,1999.
  • 4Hwang M,Choi C,Youn B,et al.Word Sense Disambiguation Based on Relation Structure[C]∥International Conference on Advanced Language Processing and Web Information Technology.2008:15-20.
  • 5Wang X,Mccallum A,Wei X.Topical N-Grams:Phrase andTopic Discovery,with an Application to Information Retrieval [C]∥IEEE International Conference on Data Mining.IEEE Computer Society,2007:697-702.
  • 6Haruechaiyasak C,Jitkrittum W,Sangkeettrakarn C,et al.Im-plementing News Article Category Browsing Based on Text Categorization Technique [C]∥2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.IEEE Computer Society,2008:143-146.
  • 7Mikolov T,Sutskever I,Chen K,et al.Distributed Representations of Words and Phrases and their Compositionality [J].Advances in Neural Information Processing Systems,2013,26:3111-3119.
  • 8Mikolov T,Chen K,Corrado G,et al.Efficient Estimation of Word Representations in Vector Space [C]∥ICLR 2013.2013.
  • 9Joachims T.A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization [M].Springer US,1997:143-151.
  • 10Hinton G E.Learning distributed representations of concepts[C]∥Proceedings of CogSci.1986:1-12.

共引文献193

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部