摘要
伴随互联网所包含网页数目的剧增,转载现象变得相当普遍。作为提高搜索引擎服务质量的关键问题之一,网页去重技术已经成为网页信息处理最为重要的环节。在对传统网页去重技术进行研究的基础上,针对网页正文的结构特征,提出了一种基于关键长句及正文长度预分类的网页去重算法的核心思想。实验证明,该算法具有较高的召回率及准确率,在重复网页的过滤中有着较好的应用前景与较高的研究价值。
As to improve the quality of search engine service one of the key problems, web page to heavy technology has become web information processing is the most important link. Based on the traditional web page to heavy technology, based on the features of the structure of web text, this paper puts forward a kind of based on key long sentences and tex^t length preliminary classification of web page to heavy algorithm core ideas. Experiments show that the algorithm has high- er recall ratio and accuracy, the duplicated web pages in the filter has a good application prospect and high research value.
出处
《软件导刊》
2012年第10期48-50,共3页
Software Guide
关键词
网页去重
关键长句
预分类
Web Page
Key Long Sentence
Preliminary Classification