期刊文献+

多层次数据增强的半监督中文情感分析方法 被引量:8

A Semi-Supervised Sentiment Analysis Method for Chinese Based on Multi-Level Data Augmentation
原文传递
导出
摘要 【目的】针对在自然语言处理领域中高质量的标签数据较难获取的问题,设计基于多层次数据增强的半监督中文情感分析方法。【方法】采用简单数据增强和反向翻译的文本增强技术获取大量无标签数据,通过对无标签数据计算一致性正则提取无标签数据的数据信号;对弱增强数据计算其预判标签,将强增强数据与预判标签一起构建监督训练信号,通过置信度阈值过滤使模型得出置信度高的预测结果。【结果】在三个公开情感分析数据集上进行实验,在Waimai和Weibo数据集上仅使用1000条有标签文档就可以分别获得超过BERT 2.311%和6.726%的性能提升。【局限】实验均在公开通用语料上进行,未验证在垂直领域数据集上的效果。【结论】所提方法充分挖掘了无标签数据的信息,可以缓解标签数据不易获取的问题,同时具有较强的预测稳定性。 [Objective]This paper designs a semi-supervised model for sentiment analysis based on multi-level data augmentation,aiming to generate high-quality labeled data for natural language processing in Chinese.[Methods]First,we collected large amount of unlabeled data with the help of simple data enhancement and reverse translation of text enhancement techniques.Then,we extracted the data signals of unlabeled samples by calculating their consistency norms.Third,we calculated the pseudo-label of the weakly enhanced samples,and constructed the supervised training signal from the strongly enhanced sample together with the pseudo-label.Finally,we set confidence threshold for the model to generate prediction results.[Results]We examined the proposed model with three publicly available datasets for sentiment analysis.With only 1000 labeled documents from the Waimai and Weibo datasets,the performance of our model was 2.311%and 6.726%better than those of the BERT.[Limitations]We did not evaluate the model’s performance with vertical domain datasets.[Conclusions]The proposed method fully utilizes the information of unlabeled samples to address the issue of insufficient labeled data,and shows strong predicting stability.
作者 刘彤 刘琛 倪维健 Liu Tong;Liu Chen;Ni Weijian(College of Computer Science and Engineering,Shandong University of Science and Technology,Qingdao 266590,China)
出处 《数据分析与知识发现》 CSSCI CSCD 北大核心 2021年第5期51-58,共8页 Data Analysis and Knowledge Discovery
基金 国家自然科学基金项目(项目编号:71704096,61602278) 青岛社会科学规划项目(项目编号:QDSKL2001117)的研究成果之一。
关键词 情感分析 半监督学习 一致性正则 数据增强 Sentiment Analysis Semi-Supervised Learning Consistency Regularity Data Augmentation
  • 相关文献

同被引文献109

引证文献8

二级引证文献17

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部