摘要
为了对WEB上不规则的动态信息按照数据库的方式集成和查询,本文采用对象交换模型(OEM)建立WEB上信息模型。为了将页面中各个部分表示为对应的OEM对象,本文(1)设计了半结构化信息的抽取算法;(2)定义了满足约束条件的数据抽取格式,并且设计了输出正确抽取格式的候选者算法;(3)给出测试结果。该方法可以抽取结构化和半结构化的信息,比现有的抽取方法通用性更强。
In order to integrate and query irregular and dynamic information on WEB in a database fashion,Object Exchange Model(OEM)is used to construct the information model of WEB. In order to express each component of the pages as an OEM object in this paper we have the following: (1) an algorithm which extracts semistructured data from HTML pages is designed; (2)a data extracting format which satisfies the constraints is defined and a candidate algorithm which outputs correct extracting format is designed; (3)the testing results have been given out.The structured and semi-structured data can thus be extracted by our method.It has more applicability than other current methods.
出处
《计算机应用与软件》
CSCD
北大核心
2002年第1期53-59,共7页
Computer Applications and Software