摘要
针对藏语情感分析研究中的特殊挑战,包括缺乏标注数据和有限的语言资源等问题,文章提出一种基于卷积神经网络CNN和词频结合的C-TF藏语情感词典自动构建方法。藏文有许多丰富的情感文本,文章对藏族传统文献八大藏戏和社交媒体评论中的情感词汇进行词频统计,结合词频和卷积神经网络计算出情感种子词,采用了大规模无标注数据进行预训练,并使用少量标注数据进行了微调,最终构建了包含12503条情感词汇的藏语情感词典。文章提出的情感词典构建方法为进一步研究藏文文本情感分类问题提供了新的思路和实验证据。
To solve the unique challenges in the study of sentiment analysis of Tibetan languages,such as the lack of annotation data and limited language resources,an automatic construction method of Tibetan sentiment dictionary based on the combination of convolutional neural network(CNN)and Term frequency(C-TF)is proposed in this paper.There are many rich emotional texts in the Tibetan language.Statistical analysis was conducted for the sentiment words collected from eight Tibetan operas in traditional Tibetan literature combined with the word frequency in some social media comments,and emotional seed words were calculated with the combination of the word frequency and CNN,And then using a large-scale unlabeled data for pre-training and a small amount of labeled data for fine-tuning,a Tibetan sentiment dictionary with 12503 emotional words is constructed.To evaluate the accuracy of the dictionary proposed in this paper,we compared it with other dictionaries using the open source sentiment analysis dataset TU_SA,and the experimental results demonstrate that our method achieves significant performance improvement in the task of emotion dictionary construction.
作者
公确多杰
索南才让
Gongque-Duojie;Suonan-Cairang(College of Computer Science and Technology,Qinghai Normal University,Xining 810016,China;Key Laboratory of Tibetan Information Processing,Ministry of Education,Xining 810008,China;Tibetan Information Processing and Machine Translation Key Laboratory of Qinghai Province,Xining 810008,China)
出处
《高原科学研究》
CSCD
2024年第3期117-124,共8页
Plateau Science Research
基金
国家社会科学基金项目(23BYY07820).
关键词
情感词典构建
低资源语言
CNN
C-TF
藏语情感分析
sentiment dictionary construction
low resource language
CNN
C-TF
Tibetan sentiment analysis