摘要
通过对网页结构化和半结构化信息的分析,提出了一种基于规则模型的网页正文提取方法。该方法在总结HTML标签的不同应用特征和网页布局的结构特征的基础上,通过定义一系列过滤、提取和合并规则来建立一个通用的网页正文抽取模型,以达到有效提取网页主题文本的目的。实验结果表明,该方法对于各类型网页主题文本的提取均具有较高的准确率,通用性强。
A web content information extraction method based on rule model is presented by analysing on structured and semi-structured web data. Based on learning from the feature of HTML tag and web page layout, a universal extraction model is built by defining a series of filtering, extracting and merging rule and web content is extracted effectively. The practice shows that this method has good accuracy in extracting web content information and is applied widely.
出处
《计算机工程与设计》
CSCD
北大核心
2009年第20期4665-4667,共3页
Computer Engineering and Design
关键词
规则模型
信息抽取
主题文本提取
数据采集
WEB挖掘
rule model
web information extraction
main body extraction
data gathering
web mining