Pretrained Models and Evaluation Data for the Khmer Language

导出

摘要 Trained on a large corpus,pretrained models(PTMs)can capture different levels of concepts in context and hence generate universal language representations,which greatly benefit downstream natural language processing(NLP)tasks.In recent years,PTMs have been widely used in most NLP applications,especially for high-resource languages,such as English and Chinese.However,scarce resources have discouraged the progress of PTMs for low-resource languages.Transformer-based PTMs for the Khmer language are presented in this work for the first time.We evaluate our models on two downstream tasks:Part-of-speech tagging and news categorization.The dataset for the latter task is self-constructed.Experiments demonstrate the effectiveness of the Khmer models.In addition,we find that the current Khmer word segmentation technology does not aid performance improvement.We aim to release our models and datasets to the community in hopes of facilitating the future development of Khmer NLP applications.

作者 Shengyi Jiang Sihui Fu Nankai Lin Yingwen Fu

机构地区 School of Information Science and Technology Guangzhou Key Laboratory of Multilingual Intelligent Processing

出处《Tsinghua Science and Technology》 SCIE EI CAS CSCD 2022年第4期709-718,共10页 清华大学学报（自然科学版（英文版）

基金 supported by the Major Projects of Guangdong Education Department for Foundation Research and Applied Research(No.2017KZDXM031) Guangzhou Science and Technology Plan Project(No.202009010021)。

关键词 pretrained models Khmer language word segmentation part-of-speech(POS)tagging news categorization

分类号 TP391.1 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献1

1QIU XiPeng,SUN TianXiang,XU YiGe,SHAO YunFan,DAI Ning,HUANG XuanJing.Pre-trained models for natural language processing: A survey[J].Science China(Technological Sciences),2020,63(10):1872-1897. 被引量：142

共引文献141

1王伟,阮文翰,孟祥福.融合对抗训练的中文GPT对话模型研究[J].辽宁工程技术大学学报（自然科学版）,2023(3):378-384.
2邱凯锋,王则远,何志超,付凯利,梅童霖,关英杰,高飞,伍俊妍.人工智能技术在超说明书用药循证中的应用研究[J].中华临床医师杂志（电子版）,2023,17(12):1212-1218.
3余同瑞,金冉,韩晓臻,李家辉,郁婷.自然语言处理预训练模型的研究综述[J].计算机工程与应用,2020,56(23):12-22. 被引量：47
4Yi HAN,Linbo QIAO,Jianming ZHENG,Hefeng WU,Dongsheng LI,Xiangke LIAO.A survey of script learning[J].Frontiers of Information Technology & Electronic Engineering,2021,22(3):341-373.
5郝超,裘杭萍,孙毅,张超然.多标签文本分类研究进展[J].计算机工程与应用,2021,57(10):48-56. 被引量：23
6邱石贵,章化奥,段湘煜,张民.神经机器翻译的词级别正则化[J].厦门大学学报（自然科学版）,2021,60(4):662-669.
7王涛,刘超辉,郑青青,黄嘉曦.基于单向Transformer和孪生网络的多轮任务型对话技术[J].计算机工程,2021,47(7):55-58.
8陈晓玲,唐丽玉,胡颖,江锋,彭巍,冯先超.基于ALBERT模型的园林植物知识实体与关系抽取方法[J].地球信息科学学报,2021,23(7):1208-1220. 被引量：5
9王永鹏,周晓磊,马慧敏,曹吉龙,无.联合知识的融合训练模型[J].计算机系统应用,2021,30(7):50-56. 被引量：1
10杨修远,彭韬,杨亮,林鸿飞.基于知识蒸馏的自适应多领域情感分析[J].山东大学学报（工学版）,2021,51(3):15-21. 被引量：1

1冯正平,王勇.融合分词和语义感知的中文文本摘要模型[J].计算机科学与应用,2021,11(12):2913-2923.
2HUANG Kaiyu,CAO Jingxiang,LIU Zhuang,HUANG Degen.Word-Based Method for Chinese Part-of-Speech via Parallel and Adversarial Network[J].Chinese Journal of Electronics,2022,31(2):337-344.
3Yatian Shen,Yubo Mai,Xiajiong Shen,Wenke Ding,Mengjiao Guo.Jointly Part-of-Speech Tagging and Semantic Role Labeling Using Auxiliary Deep Neural Network Model[J].Computers, Materials & Continua,2020(10):529-541.
4常俊豪,武钰智.基于ERNIE_BiGRU模型的中文医疗文本分类[J].电脑知识与技术,2022,18(1):101-104.
5徐况,夏献军,冯强中,王颜颜.基于ERNIE的评论文本观点分析的应用[J].信息技术与信息化,2022(3):122-125.
6Amber Ayoub,Kainaat Mahboob,Abdul Rehman Javed,Muhammad Rizwan,Thippa Reddy Gadekallu,Mustufa Haider Abidi,Mohammed Alkahtani.Classification and Categorization of COVID-19 Outbreak in Pakistan[J].Computers, Materials & Continua,2021(10):1253-1269. 被引量：1
7曹明伦.论译者的学者意识[J].中国翻译,2022,43(1):175-179. 被引量：3
8Mohd Afizi Mohd Shukran,Mohd Sidek Fadhil Mohd Yunus,Muhammad Naim Abdullah,Mohd Rizal Mohd Isa,Mohammad Adib Khairuddin,Kamaruzaman Maskat,Suhaila Ismail,Abdul Samad Shibghatullah.Survey on Clustering Techniques for Image Categorization Dataset[J].Journal of Computer and Communications,2022,10(6):177-185.
9Yu-Yang Liu,Xin-Hua Ma,Xiao-Wei Zhang,Wei Guo,Li-Xia Kang,Rong-Ze Yu,Yu-Ping Sun.A deep-learning-based prediction method of the estimated ultimate recovery(EUR)of shale gas wells[J].Petroleum Science,2021,18(5):1450-1464. 被引量：7
10Yue Jiang,Xinyu Zhang,Wohuan Jia,Li Xu.Answer Classification via Machine Learning in Community Question Answering[J].Journal on Artificial Intelligence,2021,3(4):163-169.

Tsinghua Science and Technology

2022年第4期

浏览历史

内容加载中请稍等...

Pretrained Models and Evaluation Data for the Khmer Language

参考文献1

共引文献141

相关作者

相关机构

相关主题

浏览历史