期刊文献+

融合语义解释和DeBERTa的极短文本层次分类

Very Short Texts Hierarchical Classification Combining Semantic Interpretation and DeBERTa
下载PDF
导出
摘要 文本层次分类在社交评论主题分类、搜索词分类等场景中有重要应用,这些场景的数据往往具有极短文本特征,体现在信息的稀疏性、敏感性等中,这对模型特征表示和分类性能带来了很大挑战,而层次标签空间的复杂性和关联性使得难度进一步加剧。基于此,提出了一种融合语义解释和DeBERTa模型的方法,该方法的核心思想在于:引入具体语境下各个字词或词组的语义解释,补充优化模型获取的内容信息;结合DeBERTa模型的注意力解耦机制与增强掩码解码器,以更好地把握位置信息、提高特征提取能力。所提方法首先对训练文本进行语法分词、词性标注,再构造GlossDeBERTa模型进行高准确率的语义消歧,获得语义解释序列;然后利用SimCSE框架使解释序列向量化,以更好地表征解释序列中的句子信息;最后训练文本经过DeBERTa模型神经网络后,得到原始文本的特征向量表示,再与解释序列中的对应特征向量相加,传入多分类器。实验遴选短文本层次分类数据集TREC中的极短文本部分,并进行数据扩充,最终得到的数据集平均长度为12词。多组对比实验表明,所提出的融合语义解释的DeBERTa模型性能最为优秀,在验证集和测试集上的Accuracy值、F1-micro值、F1-macro值相比其他算法模型有较大的提升,能够很好地应对极短文本层次分类任务。 Text hierarchy classification has important applications in scenarios such as social comment topic classification and search term classification.The data in these scenarios often exhibits short text features,which is reflected in the sparsity and sensitivity of information.It poses great challenges for model feature representation and classification performance.The complexity and associativity of the hierarchical label space further exacerbate the difficulties.In view of this,a method fusing semantic interpretation and DeBERTa model is proposed,and the core idea of the method is as follows:introducing the semantic interpretation of individual words or phrases in specific contexts to supplement and optimize the content information acquired by the model;combining the disentangled attention and enhanced mask decoder of the DeBERTa model to better grasp the location information and improve the feature extraction ability.The method firstly performs grammatical disambiguation and lexical annotation on the training text,and then constructs the GlossDeBERTa model to perform semantic disambiguation with high accuracy to obtain the semantic interpreted sequence.Then the SimCSE framework is used to make the interpreted sequence vectorized to better characterize the sentence information in the interpreted sequence.Finally,the training text passes through the DeBERTa model neural network to get the feature vector representations of the original text,which is then summed up with the corresponding feature vector in the interpreted sequence,and passed into the multi-class classifier.The experiments select the very short text portion of the short text hierarchical categorization dataset TREC and expand the data,resulting in a dataset with an average length of 12 words.Multiple sets of comparison experiments show that the DeBERTa model proposed in this paper with fused semantic interpretation has the best performance,and the Accuracy,F1-micro,and F1-macro values on the validation and test sets are much better than other algorithmic models,which can well cope with the task of hierarchical categorization of very short texts.
作者 陈昊飏 张雷 CHEN Haoyang;ZHANG Lei(State Key Laboratory for Novel Software Technology,Nanjing University,Nanjing 210023,China)
出处 《计算机科学》 CSCD 北大核心 2024年第5期250-257,共8页 Computer Science
基金 国家自然科学基金(62192783,62376117) 南京大学软件新技术与产业化协同创新中心。
关键词 极短文本 层次分类 语义解释 DeBERTa GlossDeBERTa SimCSE Very short text Hierarchical classification Semantic interpretation DeBERTa GlossDeBERTa SimCSE
  • 相关文献

参考文献4

二级参考文献24

共引文献65

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部