期刊文献+

Web表格信息抽取模型的设计与实现 被引量:1

THE DESIGN AND IMPLEMENTATION OF INFORMATION EXTRACTION MODEL ON WEB TABLES
下载PDF
导出
摘要 Web表格作为一种简洁有效的数据信息表达方式,已广泛应用于Web页面中。现提出一种基于表格结构的Web表格信息抽取模型,该模型主要有表格定位模块、表格结构预处理模块和表格信息抽取与重构模块三个模块组成,根据Web表格的结构标记和自定义的启发式规则来抽取表格信息。实验结果表明该模型能够很好地应用于Web表格信息的抽取。 As a compact and efficient way to present relational data information, Web tables are used frequently in Web documents. In this paper it presents a new model based on table structure that extracts information from tables of Web documents. It is composed of table positio- ning 'module, table structure pretreatment module and table information extraction and remodelling module, extracts information from table according to Web table structure label and self-defined heuristic rules. The experimental results show that this model is well performed in information extraction from tables of Web documents.
出处 《计算机应用与软件》 CSCD 2009年第4期72-74,共3页 Computer Applications and Software
基金 国家发改委基金项目(SNMCFIP-2006S001)。
关键词 表格结构 抽取模型 启发式规则 预处理 解析 Table structure Model of extracting Heuristic method rules Pretreatment Parse
  • 相关文献

参考文献5

  • 1Stefan Kuhlins, Ross Tredwell. Tookits for generating wrappers-a survey of software toolkits for automated data extraction from web sites[ C ]. International Corference NetOb-jectDay, Berlin: Springer, 2003:154 - 198.
  • 2Hammer J, Garcia-Molina H, Cho J, Aranha R and Crespo A. Extracting semistructured information from the Web[ J]. SIGOD Record, 1997,26 (2) :18 -25.
  • 3黄豫清,戚广志,张福炎.从WEB文档中构造半结构化信息的抽取器[J].软件学报,2000,11(1):73-78. 被引量:47
  • 4Chen H H,Tsai S C,Tsai J H. Mining tables from large scale html texts [C]. In The 18th International Conference on Computa-tional Linguistics(COLING) ,2000 : 166 - 172.
  • 5Gaizauskas Robert, Yorick Wilks. Infor-mation extraction:Beyond document retrieval. Journal of Documentation, 1998,54 ( 1 ) :70 - 105.

二级参考文献1

  • 1Ham mar J,SIGMOD Record,1997年,26卷,2期,18页

共引文献46

同被引文献10

引证文献1

二级引证文献3

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部