摘要
提出了一种使用聚类分析对批量到达的图文文档进行摹本识别的方法.首先把已读入计算机的单页图文文档转换为单色位图.给出若干互不相交的同心圆盘(圆盘的中心按页的边缘计算),计算出各轴像素密度(各圆环内"on"象素的个数)作为图形的特征向量.在页面的特征向量之间,建立一种距离,再进行聚类分析以识别文档的摹本.对从网下载的批量图形文档利用MATLAB进行多次仿真实验结果,单页文档的正确识别率达到了85%~98%
A method is presented for detecting duplications of a batch of image documents based on cluster analysis. First, converts a page of document have read into computer to binary bitmap. Giving a series of interlocking concentric disk (The center of all disks is computed according to the edge of this page), computing radial pixel densities (the number of 'on' pixels in each annuli) as the feature vector. Establishing a distance among feature vectors, and detecting duplications by cluster analysis. The result of stimulating experiments by MATLAB, 85%~98% of the documents got from the internet can be identified correctly.
出处
《四川大学学报(自然科学版)》
CAS
CSCD
北大核心
2003年第1期36-40,共5页
Journal of Sichuan University(Natural Science Edition)