期刊文献+

民国报纸文本基准真值制作的挑战与思考——以《晶报》为例

Challenges and Thoughts in Making Text Ground Truth for Republican Chinese Newspaper:Taking Jing Bao as an Example
原文传递
导出
摘要 欧洲和北美众多研究学者已对机器学习在光学字符识别中的应用进行了探索,许多项目也正在为此创建基准真值(ground truth,GT)数据。但对于非拉丁文本(non-Latin script)阅读材料来说,情况则有所不同。德国海德堡大学的“中国早期报刊在线数据库”(ECPO)项目于2021年开始研究如何基于中国报刊史料生成机器可读文本。ECPO采用多种机器学习方法(如卷积神经网络)开发了一个半自动流程来生成机器可读的全文文本,并选取民国时期娱乐小报《晶报》(1919—1940年)作为实验基础。文章聚焦于两方面:一是对基准真值编辑工作流程作详细阐述,包括组建编辑团队、组织工作流程、建立操作规范和确保质量控制;二是探讨制作基准真值时遇到的具体困难,包括字符编码问题、与Unicode相关的异体字符问题等。该研究项目创建了两个基准真值数据集,分别是文本型/结构化数据(全文基准真值,full-text GT)和版面分割数据(几何基准真值,geometry GT)。此外,文章还指出研究项目发现的问题及应对方案,期望提高机器学习效率,并为其他从事非拉丁文阅读材料研究的同仁提供借鉴。 Many researchers have explored the use of machine learning for optical character recognition(OCR),particularly in Europe and North America,and many projects are producing ground truth(GT)data for this purpose.It is different when it comes to non-Latin script(NLS)material.The Early Chinese Periodicals Online(ECPO)project at the University of Heidelberg started to work on ways to produce machine-readable full text from historical Chinese newspapers in 2021.ECPO uses different machine-learning approaches,including convolutional neural networks,to develop a semi-automatic pipeline to produce machine-readable full text.We chose the entertainment newspaper Jing Bao(The Crystal,1919-1940)as the basis for our experiments.Our paper focuses on two main aspects:First,we provide a description of our ground truth editing work.It includes assembling the editing team,organizing the workflows,establishing processing regulations,and ensuring quality control.Secondly,we discuss particular challenges in producing the GT sets,including issues in character encoding and problems with variant characters related to Unicode.We produced two sets of ground truth data comprising textual/structural data(full-text GT)and segmentation data(geometry GT).We hope our experiences from the project can be helpful to others working with NLS material.Based on our work,we point out some pitfalls and provide hints to avoid them in order to make machine learning more efficient.
作者 谢佳 叶淑敏 Xie Jia;Yip SukMan
出处 《数字人文研究》 2023年第4期49-62,共14页 Digital Humanities Research
关键词 基准真值 民国报刊 晶报 OCR sground truth Republican Chinese newspapers Jing Bao OCR
  • 相关文献

参考文献4

二级参考文献10

共引文献17

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部