期刊文献+

基于自动规约规则的HTML文档元数据提取 被引量:3

Metadata Extracting for HTML Document Based on Automatic Inducted Rules
原文传递
导出
摘要 利用HTML文档的元数据,可以为Web检索提供多样化的检索手段。本文提出了一种从HTML文档自动提取文档元数据的方法,对其中提取规则的设计、规约算法及其复杂度分析做出了重点介绍。该方法的提取规则在语法形式上和文档片断接近,更适合自动生成,通过自动规约生成规则无需人工分析,适应Web文档特点。文章最后给出了实验结果并进行了分析。 With the metadata of HTML documents, kinds of retrieving methods can be provided for web retrieving. This paper puts forward a method of extracting metadata from HTML documents automatically. We emphasize the design of extracting rules, induction algorithm and complexity analysis . The extraction rules are similar to the document fraction in syntax and suited for automatic induction. The automation induction rules have no need of manual analysis. So the rules can meet the requirement of the web documents. In the end the experimental results are given and analyzed.
出处 《模式识别与人工智能》 EI CSCD 北大核心 2005年第4期405-411,共7页 Pattern Recognition and Artificial Intelligence
关键词 元数据提取 基于规则 自动规约 Metadata Extracting, Rule Based, Automatic Induction
  • 相关文献

参考文献9

  • 1王晔,王继成,张福炎.基于元数据的Web信息检索研究[J].情报学报,2001,20(3):309-316. 被引量:14
  • 2Freitag D. Information Extraction from HTML: Application ofa General Machine Learning Approach. Inz Proc of the 15th National Conference on Artificial Intelligence. Madison, USA,1998, 517-523.
  • 3Kushmerick N, Thomas B. Adaptive Information Extraction:Core Technologies for Information Agents. In: Klusch M, Bergamaschi S, Edwards P, Petta P, eds. Intelligent Information Agents R&D in Europe: The AgentI,ink Persepective. 2002.http://citeseer, ist. psu. edu/kushmetrick02adaptive, html.
  • 4Hobbs J. The Generic Information Extraction System. In: Proc of the 5th Message Understanding Conference. San Francisco,USA: Morgan Kaufman, 1993, 87-91.
  • 5Kushmerick N. Wrapper Induction: Efficiency and Expressiveness. Artificial Intelligence, 2000, 118(1-2):15--68.
  • 6Soderland S. Learning Information Extraction Rules for SemiStructured and Free Text. Machine Learning, 1999, 34(1-3):233-272.
  • 7Hsu C N, Dung M T. Generating Finite-State Transducers for Semistructured Data Extraction from the Web. Information Systems, 1998, 23(8): 521-538.
  • 8Soderland S. Learning Text Analysis Rules for Domain SpecificNatural Language Processing. Ph. D Dissertation. University of Massachusetts, Amherst, USA, 1997.
  • 9狄涤,周竞扬,潘金贵.基于规则的HTML文档元数据提取[J].计算机工程,2004,30(9):85-86. 被引量:7

二级参考文献6

  • 1htm2 北京图书馆自动化发展部,中国机读目录通讯格式,1991年
  • 2Raggett D, Hors A L, Jacobs I. HTML 4.0 Specification.http://www.w3.org /TR/1998/REC-htm140-19980424/
  • 3Kobayashi M, Takeda K. Information Retrieval on the Web. ACM Computing Surveys, 2000, 32 (2): 144-173
  • 4http://dublincore.org/
  • 5Dublin Core Metadata Initiative. Dublin Core Metadata Element Set,Version 1.1: Reference Description. http://dublincore.org/documents/1999/07/02/dces/
  • 6王晔,王继成,张福炎.基于元数据的Web信息检索研究[J].情报学报,2001,20(3):309-316. 被引量:14

共引文献19

同被引文献28

引证文献3

二级引证文献36

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部