期刊文献+

自动获取HTML表格语义层次结构方法 被引量:9

Automatically extraction of semantic hierarchical structures from HTML tables
原文传递
导出
摘要 针对目前方法不能处理复杂表格或嵌套表格等缺点,提出了自动获取超文本标记语言(HTML)表格的语义层次结构的方法。该方法以表格的4种基本类型为基础,使用内容树表示表格的语义层次结构。方法主要包含3个步骤:识别HTML表格的属性单元格和值单元格;将表格拆分为基本表格;为拆分后的基本表格构造内容树,获取表格的语义层次结构。实验结果证明该方法能自动处理嵌套表格和复杂表格,复杂性不高,精度较好。 Existing approaches for extracting information from hyper text markup language (HTML) tables are incapable of processing complicated or nested tables. This paper presents an approach for extracting semantic hierarchical structures from complex HTML tables based on the four basic types of tables with a content tree used to depict the semantic hierarchical structure of the HTML table. The approach differentiates the attribute cells and value cells in the HTML table and divides the HTML table into basic tables to then construct content trees to extract the semantic hierarchical structure from the HTML table. Tests demonstrate that the approach can automatically analyze complex, nested tables with accurate results.
出处 《清华大学学报(自然科学版)》 EI CAS CSCD 北大核心 2007年第10期1586-1590,共5页 Journal of Tsinghua University(Science and Technology)
基金 国家"八六三"高技术项目(2004AA414020)
关键词 行标题表格 列标题表格 行列标题表格 内容树 row-wise table column-wise table row-column-wise table content tree
  • 相关文献

参考文献8

  • 1Yoshida M, Torisawa K, Tsujii J. Extracting attributes and their values from web pages [C]// Antonacopoulos A, Hu Jianying. Web Document Analysis : Challenges and Opportunities. Singapore : World Scientific Publishing, 2003:179 - 200.
  • 2Lim Seungjin, Ng Yiukai. retrieving hierarchical data Proceedings of the Eighth Information and Knowledge ACM, 1999: 466-474. An automated approach for from HTML tables [C] // International Conference on Management. Kansas City:
  • 3LIU Jiexue, AO Zhuoyun, Park H H, et al. An XML approach to semantically extract data from HTML tables [C]// Database and Expert Systems Applications, DEXA 2005, Lecture Notes in Computer Science 3588. Heidelberg: Springer Berlin, 2005:696-705.
  • 4Kim Yeonseok, Lee Kyongho. Extracting table information from the Web [C] // Document Analysis Systems VI. 6th International Workshop, DAS 2004, Lecture Notes in Computer Science 3163, 2004:438 - 441.
  • 5Tanaka M, Ishida T. Ontology extraction from tables on the web [C] // Proceedings of the International Symposium on Applications on Internet in SAINT-06. Washington: IEEE Computer Society, 2006: 284- 290.
  • 6Hsiao Shuling, Chou Shihchun, Chang Luping. Information extraction from HTML tables base on domain ontology [C]// International Conference on Information and Knowledge Engineering-IKE' 03. Las Vegas: CSREA Press, 2003 : 70 - 78.
  • 7LI Shijun, PENG Zhiyong, LIU Mengchi. Extraction and integration information in HTML tables [C] // Fourth International Conference on Computer and Information Technology. Nanjing, China, 2004: 315-320.
  • 8Yoshida M, Torisawa K, Tsujii J. Extracting ontologies from world wide web via HTML tables [C] //Proceedings of the Pacific Association for Computational Linguistics. Kitakyushu, Japan, 2001 : 332 - 341.

同被引文献55

引证文献9

二级引证文献24

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部