摘要
随着Internet的发展,Web页面提供的信息量日益增长,信息的密集程度也不断增强。多数Web页面包含多个信息块,它们布局紧凑,在HTML语法上具有类似的模式。针对含有多信息块的Web页面,提出一种信息抽取的方法:首先创建扩展的DOM(Document ObjectModel)树,将页面抽取成离散的信息条;然后根据扩展DOM树的层次结构,并结合必要的视觉特性和语义信息对离散化的信息条重新整合;最后确定包含信息块的子树,深度遍历DOM树实现信息抽取。该算法能对多信息块的Web页面进行信息抽取。
With the development of Intemet,the amount as well as the density of information has increased day by day. Most of the time, a single web page contains several information blocks which are close in layout and have similar mode in HTML grammar. A method of information extraction is designed in dealing with multiple information-block web pages. First,the definition of an extended D0M tree is put forward, and a given web page is dispersed into pieces of information. Then, by combining the hierarchy information with the vision features and semantic information,these discrete pieces of information are aggregated into information blocks. Finally the information block are extracted out by depth-traversing the extended DOM tree. This algorithm is applicable in dealing with web pages containing several information blocks.
出处
《计算机应用与软件》
CSCD
北大核心
2007年第6期137-139,共3页
Computer Applications and Software
关键词
DOM树
信息抽取
包装器
半结构化
DOM tree Information extraction Wrapper Semi-structured