摘要
从大篇幅的满文文档图像中分割和提取满文单词,是满文文档分析的关键步骤。该文提出了一种基于缝隙剪裁的满文文档图像单词分割和提取方法。首先,通过投影轮廓匹配策略初步涂抹并确定文本列数目;然后,在相邻文本列间自底向上地进行动态规划,寻找最小能量线,并通过中线区域约束得到不损坏满文文字部件的最佳分割线;最后,依据分割线提取独立满文文本列进而提取满文单词。结果表明,该方法在满文文档图像数据库上取得了较好的分割和提取效果。
An important step in the Manchu document analysis is segmentation and extraction Manchu words from large images of Manchu documents.The paper proposes a new Manchu word segmentation and extraction method based on seam craving.First of all,this paper detects the number of text lines by projection profile matching method,then paints them.Secondly,the minimum energy line is located by dynamic planning from bottom to top between adjacent text lines,and the best segmentation lines that don’t cut through Manchu word components are determined by restraining the midline areas.Finally the independent Manchu text column and Manchu word is extracted according to the segmentation curve.Experimental results show that this method achieved better segmentation and extraction result on Manchu document image datasets.
作者
张晶
许爽
贺建军
李敏
郑蕊蕊
ZHANG Jing;XU Shuang;HE Jianjun;LI Min;ZHENG Ruirui(College of Information and Communication Engineering,Dalian Minzu University,Dalian,Liaoning 116600,China;College of Science,Minzu University of China,Beijing 100081,China)
出处
《中文信息学报》
CSCD
北大核心
2019年第2期81-88,共8页
Journal of Chinese Information Processing
基金
国家自然科学基金(61503058
61702081)
辽宁省自然科学基金(201602190)
辽宁省自然科学基金指导计划(201602205)
辽宁省教育厅科学研究项目(L2015127)
大连市青年科技之星项目(2016RQ072)
关键词
满文文档图像
缝隙裁剪
文本列分割
投影轮廓匹配
区域约束
Manchu document images
seam craving
text line segmentation
projection profile matching
restraining the midline areas