摘要
利用HTML文档的元数据,可以为Web检索提供多样化的检索手段。本文提出了一种从HTML文档自动提取文档元数据的方法,对其中提取规则的设计、规约算法及其复杂度分析做出了重点介绍。该方法的提取规则在语法形式上和文档片断接近,更适合自动生成,通过自动规约生成规则无需人工分析,适应Web文档特点。文章最后给出了实验结果并进行了分析。
With the metadata of HTML documents, kinds of retrieving methods can be provided for web retrieving. This paper puts forward a method of extracting metadata from HTML documents automatically. We emphasize the design of extracting rules, induction algorithm and complexity analysis . The extraction rules are similar to the document fraction in syntax and suited for automatic induction. The automation induction rules have no need of manual analysis. So the rules can meet the requirement of the web documents. In the end the experimental results are given and analyzed.
出处
《模式识别与人工智能》
EI
CSCD
北大核心
2005年第4期405-411,共7页
Pattern Recognition and Artificial Intelligence
关键词
元数据提取
基于规则
自动规约
Metadata Extracting, Rule Based, Automatic Induction