摘要
去除重复网页可以提高搜索引擎的搜索精度,减少数据存储空间。目前文本去重算法以关键词去重、语义指纹去重为主,用上述算法进行网页去重时容易发生误判。通过对字符关系矩阵进行K-L展开,将每个字符映射成为一个数值,然后对这个数值序列做离散傅立叶变换,得到每个网页的傅立叶系数向量,通过比较傅立叶系数向量差异实现对网页的相似度判断。实验结果表明该方法可对网页实现较好的去重。
Removing duplicated Web pages can improve the searching accuracy and reduce the data storage space, Current de-duplication algorithms mainly focus on keywords de-duplication or semantic fingerprint de-duplication and may cause error when processing Web pages. In this paper each character was mapped into a semantic value by Karhunen-Loeve (K-L) transform of the relationship matrix, and then each document was transformed into a series of discrete values. By Fourier transform of the series each Web page was expressed as several Fourier coefficients, and then the similarity between two Web pages was calculated based on the Fourier coefficients. Experiment results show that this method can find similar Web pages efficiently.
出处
《计算机应用》
CSCD
北大核心
2008年第4期948-950,共3页
journal of Computer Applications
关键词
网页去重
K—L展开
傅立叶变换
维数压缩
duplicate removal of Web pages
Karhunen-Loeve (K-L) transform
Fourier transform
dimensions reduction