摘要
在目前的Web信息提取技术中,很多都是基于HTML结构的,由于HTML结构的经常变化,使提取模板需要经常更新,而提取模板的更新需要很多领域知识。本文提出一种基于分类算法的Web信息提取方法,通过将网页文本按照其显示属性的不同进行分组,以显示属性值为基础对Web页面文本进行分类,获取所关注文本,从而完成对Web页面的信息提取。这种提取方法操作简单,易于实现,对网页结构的依赖性小。
In the research of Web information extraction, most of the existing algorithms are based on HTML struc-ture. As the structure of HTML files changes frequently, wrapper must be updated accordingly. But the update of wrapper needs a lot of domain knowledge. In this paper, a new Web information extraction method based on classification algorithm is provided, which can group the Web text by HTML text display attributes. The information extraction of Web pages is finished by classifying the Web text with different values of the display attributes and acquiring desired text. This algorithm is easy to implementation and small-dependent of the HTML structure. Experiments prove its good performance.
出处
《计算机科学》
CSCD
北大核心
2008年第3期91-93,共3页
Computer Science
基金
国家242基金(课题编号:2005B22,2006B20)
关键词
信息提取
属性向量
WRAPPER
显示属性
Web information extraction,Attribute vector,Wrapper,Display attributes