摘要
生成伪标签是半监督立场检测的一种有效策略。在现实应用中,生成的伪标签质量存在差异,然而现有的工作将生成伪标签的质量视为是同等的,且没有充分考虑类别不平衡对伪标签生成质量的影响。为了解决上述2个问题,提出基于类别感知课程学习的半监督立场检测模型(SDCL)。首先,使用预训练分类模型对无标签推文生成伪标签;其次,根据伪标签质量的高低对推文按类别排序,并选取每个类别前k个高质量推文;最后,将各个类别选出的推文合并后重新排序,并把排序后带有伪标签的推文再输入分类模型,从而进一步优化模型参数。实验结果表明,与基线模型中表现最好的SANDS(Stance Analysis via Network Distant Supervision)相比,所提模型在3种不同划分(有标签推文总数为500、1000和1500)情况下,在StanceUS数据集上的宏平均(Mac-F1)分数分别提高了2、1和3个百分点,在StanceIN数据集上的Mac-F1分数均提高了1个百分点,验证了所提模型的有效性。
Pseudo-label generation emerges as an effective strategy in semi-supervised stance detection.In practical applications,variations are observed in the quality of generated pseudo-labels.However,in the existing working,the quality of these labels is regarded as equivalent.Furthermore,the influence of category imbalance on the quality of pseudo-label generation is not fully considered.To address these issues,a Semi-supervised stance Detection model based on Categoryaware curriculum Learning(SDCL)was proposed.Firstly,a pre-trained classification model was employed to generate pseudo-labels for unlabeled tweets.Then,tweets were sorted by category based on the quality of pseudo-labels,and the top k high-quality tweets for each category were selected.Finally,the selected tweets from each category were merged,re-sorted,and input into the classification model with pseudo-labels,thereby further optimizing the model parameters.Experimental results indicate that compared to the best-performing baseline model,SANDS(Stance Analysis via Network Distant Supervision),the proposed model demonstrates improvements in Mac-F1(Macro-averaged F1)scores on StanceUS dataset by 2,1,and 3 percentage points respectively under three different splits(with 500,1000,and 1500 labeled tweets).Similarly,on StanceIN dataset,the proposed model exhibits enhancements in Mac-F1 scores by 1 percentage point under the three splits,thereby validating the effectiveness of the proposed model.
作者
高肇泽
朱小飞
项能强
GAO Zhaoze;ZHU Xiaofei;XIANG Nengqiang(College of Computer Science and Engineering,Chongqing University of Technology,Chongqing 400054,China)
出处
《计算机应用》
CSCD
北大核心
2024年第10期3281-3287,共7页
journal of Computer Applications
基金
重庆市自然科学基金资助项目(CSTB2022NSCQ-MSX1672)
重庆市教育委员会科学技术研究计划重大项目(KJZD-M202201102)
重庆理工大学校级联合资助项目(gzlcx20233248)。
关键词
半监督
立场检测
类别不平衡
课程学习
伪标签生成
semi-supervised
stance detection
category imbalance
curriculum learning
pseudo-label generation