摘要
为克服基于版面重建的文档图像检索方法对图像质量要求高,且局限于部分文种,以及基于版面分割的文档图像检索方法受限于版面分割技术等问题,提出了一种基于二值文档图像分层密度特征的检索方法。该方法通过倾斜校正、去除黑边等预处理得到有效文本区域,提取有效文本区域的长宽比和分层密度特征,通过特征比对进行检索。实验表明:该方法对不同分辨率以及不同的输入设备具有自适应能力,对复杂版面和批注等噪声鲁棒性好,漏检率为2%,是一种简单有效的文档图像检索方法。
The development of document image databases is challenging document image retrieval techniques. Traditional layout reconstructed-based methods rely on high quality document images and can only deal with several widely used languages. The complexity of document layouts greatly hinter layout analysis-based approaches. This paper describes a multi-density feature-based algorithm for binary document images, which is independent of optical character recognition (OCR) or layout analyses. The text area is extracted after preprocessing including skew correction and marginal noise removal. Then the aspect ratio and multi-density features are extracted from the text area to select the best candidates from the document image database. Experimental results show that this approach is simple With loss rates less than 2% and can efficiently analyze images with different resolutions and different input systems. The system is also robust to noise due to such as notes and complex layouts.
出处
《清华大学学报(自然科学版)》
EI
CAS
CSCD
北大核心
2006年第7期1231-1234,共4页
Journal of Tsinghua University(Science and Technology)
基金
国家自然科学基金资助项目(60472028)
国家教育部博士点基金项目(20040003015)
关键词
文档图像
图像检索
倾斜校正
分层密度特征
document image
image retrieval
skew correction
multi-density features