摘要
针对目前国内外多种信息抽取方法中存在不同程度的局限性,提出一种基于DOM树和二叉树结构的Web表格信息抽取方法.该方法提供了以Web表格为信息抽取对象的、支持抽取方式选择的Web表格信息抽取工具.该工具将Html文档解析成DOM树,再将DOM树构建成一棵含有文本信息的二叉树,最后通过遍历二叉树实现对Web表格信息的抽取.
Aiming at the limitations in different degrees in various information extraction methods at home and abroad at present,an information extraction method over we b-tables based on DOM tree and binary tree was put forward.The method provided a web-table information extraction tool which the web-table was used as inform ation extraction objects and the choice of extraction modes was supported.The t ool parsed Html documents into DOM tree,then constructed a DOM tree into a bina ry tree containing texts,finally the information extraction of web-table was a chieved by traversing a binary tree.
出处
《华北水利水电学院学报》
2011年第3期108-110,共3页
North China Institute of Water Conservancy and Hydroelectric Power
基金
河南省教育厅科技攻关项目(2011B510008)