T-Transformer-XL和T-XLNet:两个藏语预训练模型

T-Transformer-XL and T-XLNet:Two Tibetan pretraining models

下载PDF

导出

摘要针对藏文在语料资源相对有限、可用于训练的预训练模型较为稀缺的问题,建立两个具有强编码能力的预训练模型:T-Transformer-XL和T-XLNet,并在自建大型藏语数据集T-News上分别进行训练。根据藏文文字的特殊结构,利用Sentence Piece分词模型中的字节对编码对藏文数据进行分词处理,并调整分词策略和目标函数解决不同算力和不同应用场景下的藏文生成问题。对T-Transformer-XL模型进行循环机制匹配和相对位置编码匹配,以有效建模长文本的上下文特征,对T-XLNet模型进行排列语言建模匹配,采用两种状态的自注意力机制提取文本特征。最后,通过基于自监督流形基数据增强方法,利用掩码语言模型生成逼真的增强样本,以丰富预训练模型的输出文本。实验结果表明,T-Transformer-XL和T-XLNet在文本生成任务中表现出色,可以根据具体的任务需求、可用的计算资源及模型性能的要求合理选择具体模型,实现最佳的应用效果。 To address the problem of limited Tibetan corpus resources and the scarcity of available pre-trained models for training,two pre-trained models with strong encoding capabilities are established:T-Transformer-XL and T-XLNet.These models are trained on a self-built large-scale Tibetan dataset,T-News.Considering the unique structure of the Tibetan script,the byte-pair encoding in the Sentence Piece tokenization model is used for tokenizing the Tibetan data.The tokenization strategy and objective function are adjusted to solve the Tibetan text generation problem under different computational power and application scenarios.The cyclic mechanism matching and the relative position encoding matching are performed on the T-Transformer-XL model to effectively model the contextual features of long texts,while the T-XLNet model applies the permutation language modeling matching,using a two-state self-attention mechanism to extract text features.Finally,a self-supervised manifold-based data augmentation method is employed,using a masked language model to generate realistic augmented samples to enrich the output text of the pre-trained models.Experimental results show that T-Transformer-XL and T-XLNet perform excellently in text generation tasks.Specific models can be selected based on the particular task requirements,available computational resources,and performance demands of the model to achieve optimal application results.

作者贾星星陆玉杨龙飞多拉王道顺 JIA Xingxing;LU Yu;YANG Longfei;DUO La;WANG Daoshun(School of Mathematics and Statistics,Lanzhou University,Lanzhou 730000,China;The State Key Laboratory of Tibetan Intelligent Information Processing and Application,Xining 810000,China;Tibetan Information Processing and Machine Translation Key Laboratory of Qinghai province,Xining 810000,China;Department of Computer Science and Technology,Tsinghua University,Beijing 100084,China)

机构地区兰州大学数学与统计学院省部共建藏语智能信息处理及应用国家重点实验室青海省藏文信息处理与机器翻译重点实验室清华大学计算机技术与技术系

出处《西安邮电大学学报》 2024年第4期93-99,共7页 Journal of Xi’an University of Posts and Telecommunications

基金国家自然科学基金项目(61902176) 省部共建藏语智能信息处理及应用国家重点实验室/青海省藏文信息处理与机器翻译重点实验室开放课题项目(2023-Z-004)。

关键词藏文自然语言处理:深度神经网络文本生成数据增强 Tibetan natural langrage processing deep neural network text generation data augmentation

分类号 TN929.5 [电子电信—通信与信息系统] TP18 [自动化与计算机技术—控制理论与控制工程]

引文网络
相关文献

1马吉宏.现代学徒制背景下思政课实践教学模式研究——基于杜威经验哲学的视角[J].四川省干部函授学院学报,2023(2):101-106.
2罗允励.面向情感语言建模的中文预训练模型探索与改进[J].电脑知识与技术,2024,20(23):28-30.
3钱政,严亮,孙顺远.多特征融合的半监督流形约束定位方法[J].吉林大学学报（理学版）,2024,62(5):1219-1227.
4刘良帅,赵建利,刘成龙,赵劭康,董娜.面向烟雾风险的轻量级二阶段检测器[J].制造业自动化,2024,46(8):129-135.
5炉娜.小学数学课堂中的探究式教学策略研究[J].中国科技期刊数据库科研,2024(10):0161-0164.
6王志强,王先传.基于超参数区域匹配机制的无线传感网流量优化算法[J].齐齐哈尔大学学报（自然科学版）,2023,39(4):28-33.
7仲雨乐,韩普,许鑫.基于异构图注意力网络的药物不良反应实体关系联合抽取研究[J].现代情报,2024,44(9):71-81.
8Jiao Xiang,Yuanduo Qu,Yanxin Zeng,Senyu Hu,Huiling Xu,Hong Xia,Muwei Ji,Lianfeng Duan,Fushen Lu.Erratum to“HsGDY on Ni Foam for Loading MoS_(2)/Ni_(3)S_(2) to Enhance the Performance on Lithium-Sulfur Batteries”[J].Energy Material Advances,2023,4(1):547-547.
9杨勇萍.基于学习策略的警察院校英语口语教学探索与实践[J].云南警官学院学报,2024(4):48-53.
10杨旭峰.以问题为导向的高中物理实验教学策略[J].中学物理教学参考,2024(23):9-12.

西安邮电大学学报

2024年第4期

浏览历史

内容加载中请稍等...

T-Transformer-XL和T-XLNet:两个藏语预训练模型

相关作者

相关机构

相关主题

浏览历史