期刊文献+

基于GraphSAGE网络的藏文短文本分类研究

Research on Tibetan Short Text Classification Based on GraphSAGE Network
下载PDF
导出
摘要 文本分类是自然语言处理领域的重要研究方向,由于藏文数据的稀缺性、语言学特征抽取的复杂性、篇章结构的多样性等因素导致藏文文本分类任务进展缓慢。因此,该文以图神经作为基础模型进行改进。首先,在“音节-音节”“音节-文档”建模的基础上,融合文档特征,采用二元分类模型动态网络构建“文档-文档”边,以充分挖掘短文本的全局特征,增加滑动窗口,减少模型的计算复杂度并寻找最优窗口取值。其次,针对藏文短文本的音节稀疏性,首次引入GraphSAGE作为基础模型,并探究不同聚合方式在藏文短文本分类上的性能差异。最后,为捕获节点间关系的异质性,对邻居节点进行特征加权再平均池化以增强模型的特征提取能力。在TNCC标题文本数据集上,该文模型的分类准确率达到了62.50%,与传统GCN、原始GraphSAGE和预训练语言模型CINO相比,该方法在分类准确率上分别提高了2.56%、1%和2.4%。 Test classification is an important research direction in the field of natural language processing.The Tibetan text categorization is challenged by data scarcity,complexity of extracted linguistic features,and diversity of chapter structures.In this paper,we use graph neural model as the framework.Firstly,on the basis of the"syllable-syllable"and"syllable-document",we combine the document features to dynamically construct"document-document"edge,mining the global features of short text.We also increase the sliding window to find the optimal window value.Secondly,aimed at the syllable sparsity of Tibetan short text,GraphSAGE is introduced as the base model to explore the performance difference in different aggregation functions.Finally,to capture the heterogeneity of relationships between nodes,a feature-weighting approach is proposed based on average pooling.Experiments on the TNCC title dataset show our model has reached 62.50%accuracy,outperforming the GGN,the original GraphSAGE and the pre-trained language model CINO by 2.56%,1%and 2.4%,respectively.
作者 敬容 杨逸民 万福成 国旗 于洪志 马宁 JING Rong;YANG Yimin;WAN Fucheng;GUO Qi;YU Hongzhi;MA Ning(Key Laboratory of Linguistic and Cultural Computing Ministry of Education,Northwest Minzu University,Lanzhou,Gansu 730030,China;Key Laboratory of China s Ethnic Languages and Intelligent Processing of Gansu Province,Northwest Minzu University,Lanzhou,Gansu 730030,China;Dalian Meteorological Bureau,Dalian Meteorological Information Center,Dalian,Liaoning 116000,China)
出处 《中文信息学报》 CSCD 北大核心 2024年第9期58-65,共8页 Journal of Chinese Information Processing
基金 国家自然科学基金(62366046)。
关键词 图神经网络 藏文文本分类 TNCC数据集 graph neural network Tibetan text classification TNCC dataset
  • 相关文献

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部