摘要
针对网页重复的特点和网页正文的结构特征,提出了一种动态的、层次的、鲁棒性强的网页去重算法。该方法通过将网页正文表示成正文结构树的形式,实现了一种动态的特征提取算法和层次指纹的相似度计算算法。特征提取利用长句提取算法保证了强鲁棒性。实验证明,该方法对镜像网页和近似镜像网页都能进行准确的检测。
As regard to the feature of the similarity and that of the text structure of Web pages,this paper proposed a dynamic,stratified and robust algorithm to detect and delete similar Web pages.By this method,expressed the texts of Web pages in the style of text structure trees.Then,thus implemented a dynamic algorithm to extract features of texts and a layer fingerprint algorithm to calculate similarity.That the extraction of the features made use of the algorithm of extraction of long sentences guarantees the robustness.The experimental results show that the method can carry out accurate detection concerning completely similar Web pages and partly similar ones.
出处
《计算机应用研究》
CSCD
北大核心
2010年第7期2489-2491,2497,共4页
Application Research of Computers
基金
重庆市自然科学基金资助项目(CSTC2007BB3169)
关键词
网页去重
正文结构树
长句提取
层次指纹
detection and elimination of similar Web pages
text structure tree
extraction of long sentences
layer fingerprint