摘要
多领域文本分类存在领域差异和词汇差异,导致分类的准确性和泛化性低,传统方法无法取得很好的效果.针对上述问题,本文提出基于变分信息瓶颈多任务算法的多领域文本分类方法,将任务建模为从综合特征中提取任务专属特征的分层学习表示问题.首先基于信息瓶颈原理,将综合特征和任务专属特征之间存在的冗余信息建模为均值为零,方差为对角矩阵的加性噪声,通过重参数化方法让噪声参与模型训练;其次通过信息瓶颈的变分边界构建模型损失函数以限制模型的信息流动,从而将带有加性噪声的综合特征解耦为任务专属特征;最后通过解码器中的分类器处理任务专属特征得到文本分类结果.实验表明,该模型在FDU-MTL多领域文本分类数据集上的平均分类准确率达到92.17%,较多个对比模型有明显提升,且该模型具有更好的可解释性.
Multi-domain text classification is challenged by domain and vocabulary differences,resulting in low accuracy and generalization.Traditional methods are ineffective in addressing this issue.This paper proposes a multi-domain text classification method based on a variational information bottleneck multi-task algorithm.The task is formulated as a hierarchical learning representation problem that extracts task-specific features from comprehensive features.Firstly,we introduce additive between comprehensive features and taskspecific features,following the information bottleneck principle.Secondly,we construct a model loss function to limit the information flow through the variational boundary of the information bottleneck,decoupling the comprehensive features with additive noise into task-specific features.Finally,the classifier in the decoder utilizes the task-specific features to generate text classification results.The proposed model achieves an average classification accuracy of 92.17%on the FDU-MTL multi-domain text classification dataset,outperforming several compared models and demostrating better interpretability.
作者
马儀
邵玉斌
杜庆治
龙华
马迪南
MA Yi;SHAO Yu-Bin;DU Qing-Zhi;LONG Hua;MA Di-Nan(Faculty of Information Engineering and Automation,Kunming University of Science and Technology,Kunming 650500,China;Yunnan Provincial Key Laboratory of Media Integration,Kunming 650032,China)
出处
《四川大学学报(自然科学版)》
CAS
CSCD
北大核心
2024年第3期125-135,共11页
Journal of Sichuan University(Natural Science Edition)
基金
云南省媒体融合重点实验室项目资助(320225403)。
关键词
信息瓶颈
多任务模型
多领域
变分边界
可解释性
Information bottleneck
Multi-task model
Multi-domain
Variational boundary
Interpretability