融合语义解释和DeBERTa的极短文本层次分类

Very Short Texts Hierarchical Classification Combining Semantic Interpretation and DeBERTa

下载PDF

导出

摘要文本层次分类在社交评论主题分类、搜索词分类等场景中有重要应用,这些场景的数据往往具有极短文本特征,体现在信息的稀疏性、敏感性等中,这对模型特征表示和分类性能带来了很大挑战,而层次标签空间的复杂性和关联性使得难度进一步加剧。基于此,提出了一种融合语义解释和DeBERTa模型的方法,该方法的核心思想在于:引入具体语境下各个字词或词组的语义解释,补充优化模型获取的内容信息;结合DeBERTa模型的注意力解耦机制与增强掩码解码器,以更好地把握位置信息、提高特征提取能力。所提方法首先对训练文本进行语法分词、词性标注,再构造GlossDeBERTa模型进行高准确率的语义消歧,获得语义解释序列;然后利用SimCSE框架使解释序列向量化,以更好地表征解释序列中的句子信息;最后训练文本经过DeBERTa模型神经网络后,得到原始文本的特征向量表示,再与解释序列中的对应特征向量相加,传入多分类器。实验遴选短文本层次分类数据集TREC中的极短文本部分,并进行数据扩充,最终得到的数据集平均长度为12词。多组对比实验表明,所提出的融合语义解释的DeBERTa模型性能最为优秀,在验证集和测试集上的Accuracy值、F1-micro值、F1-macro值相比其他算法模型有较大的提升,能够很好地应对极短文本层次分类任务。 Text hierarchy classification has important applications in scenarios such as social comment topic classification and search term classification.The data in these scenarios often exhibits short text features,which is reflected in the sparsity and sensitivity of information.It poses great challenges for model feature representation and classification performance.The complexity and associativity of the hierarchical label space further exacerbate the difficulties.In view of this,a method fusing semantic interpretation and DeBERTa model is proposed,and the core idea of the method is as follows:introducing the semantic interpretation of individual words or phrases in specific contexts to supplement and optimize the content information acquired by the model;combining the disentangled attention and enhanced mask decoder of the DeBERTa model to better grasp the location information and improve the feature extraction ability.The method firstly performs grammatical disambiguation and lexical annotation on the training text,and then constructs the GlossDeBERTa model to perform semantic disambiguation with high accuracy to obtain the semantic interpreted sequence.Then the SimCSE framework is used to make the interpreted sequence vectorized to better characterize the sentence information in the interpreted sequence.Finally,the training text passes through the DeBERTa model neural network to get the feature vector representations of the original text,which is then summed up with the corresponding feature vector in the interpreted sequence,and passed into the multi-class classifier.The experiments select the very short text portion of the short text hierarchical categorization dataset TREC and expand the data,resulting in a dataset with an average length of 12 words.Multiple sets of comparison experiments show that the DeBERTa model proposed in this paper with fused semantic interpretation has the best performance,and the Accuracy,F1-micro,and F1-macro values on the validation and test sets are much better than other algorithmic models,which can well cope with the task of hierarchical categorization of very short texts.

作者陈昊飏张雷 CHEN Haoyang;ZHANG Lei(State Key Laboratory for Novel Software Technology,Nanjing University,Nanjing 210023,China)

机构地区南京大学计算机软件新技术全国重点实验室

出处《计算机科学》 CSCD 北大核心 2024年第5期250-257,共8页 Computer Science

基金国家自然科学基金(62192783,62376117) 南京大学软件新技术与产业化协同创新中心。

关键词极短文本层次分类语义解释 DeBERTa GlossDeBERTa SimCSE Very short text Hierarchical classification Semantic interpretation DeBERTa GlossDeBERTa SimCSE

分类号 TP391.1 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献4

1陈立潮,秦杰,陆望东,潘理虎,张睿.自注意力机制的短文本分类方法[J].计算机工程与设计,2022,43(3):728-734. 被引量：13
2黄春梅,王松磊.基于词袋模型和TF-IDF的短文本分类研究[J].软件工程,2020,23(3):1-3. 被引量：25
3李博涵,向宇轩,封顶,何志超,吴佳骏,戴天伦,李静.融合知识感知与双重注意力的短文本分类模型[J].软件学报,2022,33(10):3565-3581. 被引量：20
4杨飞洪,王序文,李姣.基于BERT-TextCNN模型的临床试验筛选短文本分类方法[J].中华医学图书情报杂志,2021,30(1):54-59. 被引量：12

二级参考文献24

1钱爱兵,江岚.基于改进TF-IDF的中文网页关键词抽取——以新闻网页为例[J].情报理论与实践,2008,31(6):945-950. 被引量：29
2赵鹏,蔡庆生,王清毅,耿焕同.一种基于复杂网络特征的中文文档关键词抽取算法[J].模式识别与人工智能,2007,20(6):827-831. 被引量：44
3施聪莺,徐朝军,杨晓江.TFIDF算法研究综述[J].计算机应用,2009,29(B06):167-170. 被引量：218
4胡学钢,李星华,谢飞,吴信东.基于词汇链的中文新闻网页关键词抽取方法[J].模式识别与人工智能,2010,23(1):45-51. 被引量：22
5刘知远,孙茂松,林衍凯,谢若冰.知识表示学习研究进展[J].计算机研究与发展,2016,53(2):247-261. 被引量：264
6周飞燕,金林鹏,董军.卷积神经网络研究综述[J].计算机学报,2017,40(6):1229-1251. 被引量：1790
7刘全,梁斌,徐进,周倩.一种用于基于方面情感分析的深度分层网络模型[J].计算机学报,2018,41(12):2637-2652. 被引量：47
8Yan Danfeng,Ke Nan,Gu Chao,Cui Jianfei,Ding Yiqi.Multi-label text classification model based on semantic embedding[J].The Journal of China Universities of Posts and Telecommunications,2019,26(1):95-104. 被引量：3
9孟涛,王诚.基于扩展短文本词特征向量的分类研究[J].计算机技术与发展,2019,29(4):57-62. 被引量：10
10王鑫,邹磊,王朝坤,彭鹏,冯志勇.知识图谱数据管理研究综述[J].软件学报,2019,30(7):2139-2174. 被引量：154

共引文献65

1杨忠霖,顾益军.一种基于BERT微调-TextCNN的电信网络诈骗案情文本分类设计[J].电子测试,2023(3):47-53.
2邢月晗,郑岩.语音转录后文本的中文拼写纠错模型[J].电子测量技术,2023,46(6):57-61.
3高颀.基于“Effect-theme”共现网络的专利分类方法[J].信息技术与信息化,2020(4):137-142. 被引量：2
4孙亦昕,许露,郑翼斐,朱妍,唐媛,董猛,刘宇,胡凯.基于非平衡学习与交互式标注的引文情感动机标注系统[J].软件工程,2020,23(7):56-59. 被引量：1
5刘勇,陈文生.电子病历术语规范化流程及临床辅助诊断系统设计[J].医学信息,2020,33(14):7-9. 被引量：2
6刘洪浩.基于深度学习的COVID-19疫情期间网民情绪分析[J].软件,2020,41(9):185-188. 被引量：4
7周达明,李黎.基于CNN-BiGRU模型的操作票自动化校验方法[J].广东电力,2020,33(9):58-65. 被引量：5
8王丽,肖小玲,张乐乐.TF-IDF和Word2vec在新闻文本分类中的比较研究[J].电脑知识与技术,2020,16(29):220-222. 被引量：3
9黄国鑫,朱守信,王夏晖,田梓,季国华,卢然,崔轩,陈茜.基于自然语言处理和机器学习的疑似土壤污染企业识别[J].环境工程学报,2020,14(11):3234-3242. 被引量：8
10王国桥,牛少彰.基于TF-IDF的社交电商文本信息分类研究[J].网络空间安全,2020,11(12):32-38. 被引量：6

1王佳昊,闫航,胡鑫,赵德鑫.基于可穿戴设备的开放集动作识别技术研究[J].计算机科学,2024,51(4):291-298.
2叶闯.使真解释:联通语言到世界[J].复印报刊资料（逻辑）,2022(4):3-17.
3朱叶芬,线岩团,余正涛,相艳.基于局部Transformer的泰语分词和词性标注联合模型[J].智能系统学报,2024,19(2):401-410.
4张晓婷.重视即席讲话,提高口语表达能力[J].小学阅读指南（高年级版）,2024(5):14-16.
5霍一帆,王轩,董小铭,于洪,闵帆.基于布尔矩阵分解和神经网络的多标签学习[J].昆明理工大学学报（自然科学版）,2024,49(2):49-61.
6徐寅森,李红艳,张子栋.基于机器学习的传感网核心节点漏洞检测仿真[J].计算机仿真,2024,41(3):410-414.
7朱敏,付敏.复多一阶逻辑的“本体论无辜”问题辨析[J].复印报刊资料（逻辑）,2022(4):73-78.
8李静,刘海砚,李佳,陶泽坤,刘俊楠,叶林.时空矢量场下人群活动聚散模式提取与分析[J].测绘工程,2024,33(3):1-13.
9潘志敏.具体语境下的随文识字教学探析[J].小学教学参考,2024(10):91-93.
10高鹏淇,黄鹤鸣.基于ASGRU-CNN时空双通道的语音情感识别[J].计算机仿真,2024,41(4):180-186. 被引量：3

计算机科学

2024年第5期

浏览历史

内容加载中请稍等...

融合语义解释和DeBERTa的极短文本层次分类

参考文献4

二级参考文献24

共引文献65

相关作者

相关机构

相关主题

浏览历史