期刊文献+

MLWS2021藏文分词评测报告

Study on the Evaluation of Tibetan Words Segmentation of MLWS2021
下载PDF
导出
摘要 藏文分词是藏文信息处理中关键的基础性工作,是机器翻译、智能检索、自然语言理解等智能信息处理的前提。藏文作为“少数民族语言分词技术评测MLWS2021”的一种评测语种,在MLWS2017的基础上,语料从新闻类单一语料扩展为新闻、法律、经济、小说和语言文字等多领域综合语料,训练语料和测试语料的质和量都有了较大的提升。文章介绍MLWS2021中藏文分词评测语料的构成、收集、整理情况;再分析藏文分词评测分析软件设计思想的基础上,针对测试语料的多样性,设计了“文本对比”和“藏文评测分析”软件,按需建设评测软件测试语料并测试证明了软件的正确性;最后,在不破坏评测语料的基础上,对语料进行预处理和测试,给出了参赛队不同模型的藏文分词评测结果并验证了结果的正确性。 Tibetan word segmentation is a key and basic work in Tibetan information processing, and is the premise of intelligent information processing such as machine translation, intelligent retrieval, and natural language understanding. Tibetan is one of evaluation languages of“Evaluation dataset of Word Segmentation technology in Minority Languages”(MLWS2021), which is developed on the basis of MLWS2017. In MLWS2021, corpus has expanded from a single corpus of news to a comprehensive corpus in many fields such as news, law, economics,fiction and language, and the quality and quantity of training corpus and test corpus have been greatly improved.In this paper, firstly, the composition, collection and collation of the Tibetan word segmentation evaluation corpus of MLWS2021 are introduced;and then, "text comparison" and "Tibetan evaluation and analysis" software are proposed on the basis of re-analysis the design ideas of the Tibetan word segmentation evaluation and analysis software and aiming at the diversity of the test corpus. Furthermore, the evaluation software test corpus is constructed on demand and the correctness of the software is verified. Finally, without destroying the evaluation corpus, the corpus is pre-processed and tested, and the Tibetan word segmentation evaluation results of differert modles and the correctness of the results is verified.
作者 高定国 杨晓龙 杨宇帆 取次 高红梅 GAO Dingguo;YANG Xiaolong;YANG Yufan;Quci;GAO Hongmei(School of Information Science and Technology,Tibet University,Lhasa 850000,China)
出处 《高原科学研究》 CSCD 2022年第1期82-89,共8页 Plateau Science Research
基金 国家自然科学基金项目(6266038) 国家语委科研重点项目(ZDI135-118) 2021年度自治区一流课程建设项目。
关键词 MLWS2021 藏文分词 评测 MLWS2021 Tibetan word segmentation Review
  • 相关文献

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部