摘要
Web页面的布局可以分为:主要内容、单位标识、导航信息、交互信息和版权申明。我们在处理这些页面时往往只关心主要内容,而且可以从语义上快速定位到主要内容,但是软件系统要做到这一点就非常困难。本文提出一种基于标记树的Web页面区域划分和搜索方法,让软件系统可以忽略别的区域,快速定位到主要内容。对于大量Web页面处理而言,这种方法可以起到减少时间,缩小空间的作用,Web页面越多,效果就越显著。
A Web page can be divided into several parts, they are “the main part, the department logo, the navigation bar, the hyperlinks and the copyright”. How to get the main part of Web pages. It's easy for humankind, but hard for computer pocessing. In this paper we tackle the problem by exploring a tag tree, which can suitably express the structure and the layout of Web pages. Here we propose a method to build the tag tree, in addition to develop a single path tag tree named tag tree model, which only describe the main part of Web pages.
出处
《计算机科学》
CSCD
北大核心
2005年第8期182-185,共4页
Computer Science