摘要
【目的】研究资源稀少语言中预训练模型的表现,为构建藏语知识图谱、语义检索提供帮助。【方法】本研究采集人民网、人民网藏文版等新闻网站中藏族传统节日的汉藏双语文本数据,并比较多种预训练语言模型与词向量在汉藏双语情景下对命名实体识别任务的表现,同时分析了命名实体识别模型的两种特征处理层(BiLSTM层与CRF层)对实验结果的影响。【结果】实验结果表明:相较于词向量,汉语以及藏语的预训练语言模型在该任务上的F1性能分别提升0.0108及0.0590。特别是在实体数量较少的情景下,预训练模型相比词向量可提取更多的文本信息,并且训练时间缩短40%。【局限】藏语数据与汉语数据并非平行语料,且藏语数据中的实体数量少于汉语数据。【结论】预训练语言模型不仅在汉语文本领域有显著效果,在藏语这种资源稀少的语种也能取得很好的表现。
[Objective]This paper examines the performance of pre-trained models in resource-scarce languages and assists in building Tibetan knowledge graphs and semantic retrieval.[Methods]We collected Chinese-Tibetan bilingual text data related to traditional Tibetan festivals from websites such as People’s Daily and its Tibetan Edition.Then,we compared the performance of multiple pre-trained language models and word embeddings on named entity recognition tasks in a Chinese-Tibetan bilingual context.We also analyzed the impact of two feature processing layers(BiLSTM and CRF)in the named entity recognition model.[Results]Compared with word embeddings,the pre-trained language models of Chinese and Tibetan improved the F1 performance by 0.0108 and 0.0590,respectively.In the context of fewer entities,the pre-trained model can extract more textual information than word embeddings,reducing the training time by 40%.[Limitations]The Tibetan and Chinese language data are not parallel corpora,and the Tibetan language data has fewer entities than the Chinese data.[Conclusions]The pre-trained models demonstrate significant performance in the Chinese text domain but also perform well in Tibetan,a language with scarce resources.
作者
邓宇扬
吴丹
Deng Yuyang;Wu Dan(School of Information Management,Wuhan University,Wuhan 430072,China;Center for Studies of Human-Computer Interaction and User Behavior,Wuhan University,Wuhan 430072,China)
出处
《数据分析与知识发现》
CSSCI
CSCD
北大核心
2023年第7期125-135,共11页
Data Analysis and Knowledge Discovery
基金
国家社会科学基金重大项目(项目编号:19ZDA341)研究成果之一。
关键词
命名实体识别
藏族传统文化
预训练语言模型
Named Entity Recognition
Tibetan Traditional
Culture Pretrained Language Model