第一届古代汉语分词和词性标注国际评测被引量：5

Review of the First International Ancient Chinese Word Segmentation and POS Tagging Bakeoff

下载PDF

导出

摘要中文古籍数量庞大,亟待智能处理方法进行自动处理。古文的自动分词和词性标注,是古汉语信息处理的基础任务。而大规模词库和标注语料库的缺失,导致古汉语自动分析技术发展较慢。该文介绍了第一届古代汉语分词和词性标注国际评测的概况,评测以人工标校的精加工语料库作为统一的训练数据,以F_(1)值作为评测指标,比较了古汉语词法分析系统在测试数据(基测集和盲测集)上的优劣。评测还根据是否使用外部资源,区分出开放和封闭两种测试模式。该评测在第十三届语言资源与评测会议的第二届历史和古代语言技术研讨会上举办,共有14支队伍参赛。在基测集上,封闭测试模式分词和词性标注的F_(1)值分别达到了96.16%和92.05%,开放测试模式分词和词性标注的F_(1)值分别达到了96.34%和92.56%。在盲测集上,封闭测试分词和词性标注的F_(1)值分别达到93.64%和87.77%,开放测试分词和词性标注F_(1)值则分别达到95.03%和89.47%。未登录词依然是古代汉语词法分析的瓶颈。该评测的最优系统把目前古汉语词法分析提高到新的水平,深度学习和预训练模型有力地提高了古汉语自动分析的效果。 Automatic word segmentation and part-of-speech tagging of ancient texts are the basic tasks of ancient Chinese information processing.The lack of large-scale vocabulary and annotated corpus leads to the slow development of ancient Chinese processing technology.The paper summrizes the First International Ancient Chinese Word Segmentation and POS Tagging Bakeoff,which provies manually annotated corpus as unified training data and basic test set and blind test set.The bakeoff also distinguishes open and close test mode according to whether external resources are used.The bakeoff was held at the Second Workshop on Language Technologies for Historical and Ancient Languages(LT4HALA),which is in the context of the 13th Edition of the Language Resources and Evaluation Conference(LREC).A total of 14teams participated in the bakeoff.On the basic test set,the F1-scores of word segmentation and POS tagging reaches 96.16%and 92.05%,respectively,in the close test,while 96.34%and 92.56%,respectively,in the open test.On the blind test set,the F1-scores of word segmentation and POS tagging reaches 93.64%and 87.77%,respectively,in the close test,while 95.03%and 89.47%,respectively,in the open test.The out-of-vocabulary words are still the barrier of ancient Chinese lexical analysis,and the deep learning and pre-training model effectively improve the performance of automatic ancient Chinese processing.

作者李斌袁义国芦靖雅冯敏萱许超曲维光王东波 LI Bin;YUAN Yiguo;LU Jingya;FENG Minxuan;XU Chao;QU Weiguang;WANG Dongbo(School of Chinese Language and Literature,Nanjing Normal University,Nanjing,Jiangsu 210097,China;School of Computer and Electronic Information,Nanjing Normal University,Nanjing,Jiangsu 210023,China;College of Information Management,Nanjing Agricultural University,Nanjing,Jiangsu 210095,China)

机构地区南京师范大学文学院南京师范大学计算机与电子信息学院南京农业大学信息管理学院

出处《中文信息学报》 CSCD 北大核心 2023年第3期46-53,64,共9页 Journal of Chinese Information Processing

基金国家社会科学基金(21ZD&331) 江苏省社会科学基金(20JYB004) 国家语委项目(YB145—41) 古籍工作重点课题(22GJK006)

关键词古汉语评测自动分词词性标注古文信息处理 ancient Chinese evaluation word segmentation POS tagging ancient language information processing

分类号 TP391 [自动化与计算机技术—计算机应用技术]