摘要
针对藏文在语料资源相对有限、可用于训练的预训练模型较为稀缺的问题,建立两个具有强编码能力的预训练模型:T-Transformer-XL和T-XLNet,并在自建大型藏语数据集T-News上分别进行训练。根据藏文文字的特殊结构,利用Sentence Piece分词模型中的字节对编码对藏文数据进行分词处理,并调整分词策略和目标函数解决不同算力和不同应用场景下的藏文生成问题。对T-Transformer-XL模型进行循环机制匹配和相对位置编码匹配,以有效建模长文本的上下文特征,对T-XLNet模型进行排列语言建模匹配,采用两种状态的自注意力机制提取文本特征。最后,通过基于自监督流形基数据增强方法,利用掩码语言模型生成逼真的增强样本,以丰富预训练模型的输出文本。实验结果表明,T-Transformer-XL和T-XLNet在文本生成任务中表现出色,可以根据具体的任务需求、可用的计算资源及模型性能的要求合理选择具体模型,实现最佳的应用效果。
To address the problem of limited Tibetan corpus resources and the scarcity of available pre-trained models for training,two pre-trained models with strong encoding capabilities are established:T-Transformer-XL and T-XLNet.These models are trained on a self-built large-scale Tibetan dataset,T-News.Considering the unique structure of the Tibetan script,the byte-pair encoding in the Sentence Piece tokenization model is used for tokenizing the Tibetan data.The tokenization strategy and objective function are adjusted to solve the Tibetan text generation problem under different computational power and application scenarios.The cyclic mechanism matching and the relative position encoding matching are performed on the T-Transformer-XL model to effectively model the contextual features of long texts,while the T-XLNet model applies the permutation language modeling matching,using a two-state self-attention mechanism to extract text features.Finally,a self-supervised manifold-based data augmentation method is employed,using a masked language model to generate realistic augmented samples to enrich the output text of the pre-trained models.Experimental results show that T-Transformer-XL and T-XLNet perform excellently in text generation tasks.Specific models can be selected based on the particular task requirements,available computational resources,and performance demands of the model to achieve optimal application results.
作者
贾星星
陆玉
杨龙飞
多拉
王道顺
JIA Xingxing;LU Yu;YANG Longfei;DUO La;WANG Daoshun(School of Mathematics and Statistics,Lanzhou University,Lanzhou 730000,China;The State Key Laboratory of Tibetan Intelligent Information Processing and Application,Xining 810000,China;Tibetan Information Processing and Machine Translation Key Laboratory of Qinghai province,Xining 810000,China;Department of Computer Science and Technology,Tsinghua University,Beijing 100084,China)
出处
《西安邮电大学学报》
2024年第4期93-99,共7页
Journal of Xi’an University of Posts and Telecommunications
基金
国家自然科学基金项目(61902176)
省部共建藏语智能信息处理及应用国家重点实验室/青海省藏文信息处理与机器翻译重点实验室开放课题项目(2023-Z-004)。
关键词
藏文
自然语言处理:深度神经网络
文本生成
数据增强
Tibetan
natural langrage processing
deep neural network
text generation
data augmentation