摘要
面对Web信息的迅猛增长,信息抽取技术非常适合于从大量的文档中抽取需要的事实数据。通过文档对象模型(DOM)解析以及检索、抽取、映射等规则的定义,设计并实现了一种具有规则归纳能力的信息抽取系统,用于Web信息的自动检索。在用于抽取规则归纳的框架下,还重点对用于生成抽取模式的WHISK学习算法进行了实验对比分析,结果表明系统对于单槽和多槽数据都具有不错的归纳学习能力。
With the rapid increase of Web information,Information Extraction (IE) techniques are good for automatically extracting data of interest from a mass of Web documents.In this paper,the design and the implementation of a rule induction based IE system is presented for automating Web information retrieval by DOM parsing and rules for retrieval,extraction and mapping. In this framework for rule induction,the authors particularly focus on the experiments with the WHISK algorithm for generating patterns.Experimental results show that the system performs well on both single-slot and multi-slot extraction tasks.
出处
《计算机工程与应用》
CSCD
北大核心
2008年第21期166-170,共5页
Computer Engineering and Applications
基金
国家自然科学基金( the National Natural Science Foundation of China under Grant No.60775028)
大连市科技局重大项目( No.2007A14GX042)
吉林大学符号计算与知识工程教育部重点实验室开放课题( No.93K-17-2006-04)
关键词
信息抽取
抽取规则
DOM
学习算法
information extraction
extraction rule
DOM
leaming algorithm