期刊文献+

Pretrained Models and Evaluation Data for the Khmer Language

原文传递
导出
摘要 Trained on a large corpus,pretrained models(PTMs)can capture different levels of concepts in context and hence generate universal language representations,which greatly benefit downstream natural language processing(NLP)tasks.In recent years,PTMs have been widely used in most NLP applications,especially for high-resource languages,such as English and Chinese.However,scarce resources have discouraged the progress of PTMs for low-resource languages.Transformer-based PTMs for the Khmer language are presented in this work for the first time.We evaluate our models on two downstream tasks:Part-of-speech tagging and news categorization.The dataset for the latter task is self-constructed.Experiments demonstrate the effectiveness of the Khmer models.In addition,we find that the current Khmer word segmentation technology does not aid performance improvement.We aim to release our models and datasets to the community in hopes of facilitating the future development of Khmer NLP applications.
出处 《Tsinghua Science and Technology》 SCIE EI CAS CSCD 2022年第4期709-718,共10页 清华大学学报(自然科学版(英文版)
基金 supported by the Major Projects of Guangdong Education Department for Foundation Research and Applied Research(No.2017KZDXM031) Guangzhou Science and Technology Plan Project(No.202009010021)。
  • 相关文献

参考文献1

共引文献141

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部