摘要
近年来,随着人们对历史和传统文化的保护和传承越来越重视,研究人员对历史文献数字化的兴趣也越来越高涨。版面分析是历史文献数字化的重要基础步骤,该文提出了一种基于卷积降噪自编码器的藏文历史文献版面分析方法。首先,将藏文历史文献图像进行超像素聚类获得超像素块;然后,利用卷积降噪自编码器提取超像素块的特征;最后,使用SVM分类器对藏文历史文献的超像素块进行分类预测,从而提取出藏文历史文献版面的各个部分。在藏文历史文献数据集上的实验表明,该方法能够对藏文历史文献的不同版面元素进行有效的分离。
The digitalization of historical documents attract increasing research interests in recent years.Focusing on layout analysis,the essential step in digitizing historical documents,this paper proposes a convolutional denoising auto-encoder approach to historical Tibetan documents.Firstly,the document images are clustered into superpixel blocks.Then,we use the convolutional autoencoder to extract features from these blocks.Finally,the superpixel blocks are classified by the SVM classifier,thus the different parts of the Tibetan historical document are identified.Experiments on the dataset of historical Tibetan documents show that our method can effectively separate the different layout elements of Tibetan historical documents.
作者
张西群
马龙龙
段立娟
刘泽宇
吴健
ZHANG Xiqun;MA Longlong;DUAN Lijuan;LIU Zeyu;WU Jian(Faculty of Information Technology,Beijing University of Technology,Beijing 100124,China;Beijing Key Laboratory of Trusted Computing,Beijing 100124,China;Chinese Information Processing Laboratory,Institute of Software,Chinese Academy of Sciences,Beijing 100190,China;Beijing Key Laboratory on Integration and Analysis of Large-scale Stream Data,Beijing 100124,China)
出处
《中文信息学报》
CSCD
北大核心
2018年第7期67-73,81,共8页
Journal of Chinese Information Processing
基金
藏文历史文献数字化与共享关键技术平台(2016-ZJ-Y04)
青海省基础研究计划项目(2016-ZJ-740)
关键词
藏文历史文献
版面分析
卷积降噪自编码器
超像素
historical Tibetan documents
layout analysis
convolutional denoising autoencoder
superpixel