摘要
随着印尼语、马来语互联网普及率的上升,对海量印尼语、马来语文本进行信息处理存在重大需求.虽然研究人员对印尼语、马来语展开较广泛的研究,但是作为低资源语言,受到的关注远不及通用语,未能较好利用前沿的深度学习方法.文中梳理总结包括词法分析、句法分析、机器翻译、拼写检查等印尼语、马来语相关的自然语言处理技术.对比分析相关的研究成果发现,大多数研究因语料规模及评测标准不同难以客观对比各种算法的差异.最后结合印尼语、马来语现有的各领域语言资源开放情况,指出印尼语、马来语的自然语言处理研究面临的问题,并展望未来发展趋势.
As the penetration rate of Indonesian and Malay rises,it is significant to carry out information processing on massive texts of these two languages.Extensive research is conducted on Indonesian and Malay.However,as low-resource languages,Indonesian and Malay draw less attention than common languages.Thus,the deep learning methods cannot be fully utilized.In this paper,research on Indonesian and Malay morphological analysis,syntactic parsing,machine translation,spelling check etc.,is analyzed and summarized.In the most research findings,algorithms cannot be compared objectively due to their different corpus scales and evaluation metrics.Finally,problems and future directions of natural language processing on Indonesian and Malay are discussed with the consideration of the existing open language resources in various fields.
作者
蒋盛益
李珊珊
符斯慧
林楠铠
JIANG Shengyi;LI Shanshan;FU Sihui;LIN Nankai(School of Information Science and Technology,Guangdong University of Foreign Studies,Guangzhou 510006;Guangzhou Key Laboratory of Multilingual Intelligent Processing,Guangdong University of Foreign Studies,Guangzhou 510006)
出处
《模式识别与人工智能》
EI
CSCD
北大核心
2020年第6期530-541,共12页
Pattern Recognition and Artificial Intelligence
基金
国家自然科学基金项目(No.61572145)
广州市科技计划项目(No.202002030227)资助。
关键词
印尼语
马来语
黏着语
低资源语言
自然语言处理
Indonesian
Malay
Agglutinative Language
Low-Resource Language
Natural Language Processing