摘要
大多数网页都有如广告、版权、导航链接等噪声,影响Web应用系统的工作质量,因此快速准确地清除网页中的噪声内容是提高Web应用程序性能的关键技术之一。提出了一种网页净化方法,通过用模式树(PT)表示网页的布局结构,根据模式树中节点的信息熵来消除噪声,以达到网页净化的目的。试验将此方法应用于一个SVM分类系统,结果显示通过净化的网页对分类结果的正确率和高效性都有了一定的改进。
Most Web pages usually have such noisy block~ as navigation panels, copyright and advertisements, which decreases the accuracies of Web applications system. So eliminating noises content accurately and efficiently is a key technique to improve the service qualities of Web application systems. This paper proposes a novel approach to reduce the noise content in Web pages. It uses a tree structure, called pattern tree(PT), to capture the common layout of the pages in a given Web site. It also introduces an entropy-based measure of the node in the PT to reduce noisy blocks of the site. The approach is applied in a SVM-based Web page classification system. The strong evidence of improvement in applications verifies the validity of the approach presented.
出处
《计算机工程》
CAS
CSCD
北大核心
2007年第8期89-91,共3页
Computer Engineering
关键词
文档树
模式树
基本节点
风格节点
网页净化
Document tree
Pattern tree
Element node
Style node
Web page purification