摘要
提出一种基于有效信息量的Web表格信息抽取模型,该模型主要由表格定位和表格信息抽取二个模块组成,根据Web表格的内容特征来识别主题表格,通过检查格式、语法的特征将表格分割成值域与属性域.实验结果表明该模型能够很好地应用于Web表格信息的抽取.
It is proposed that a new model based on table structure that extracts information from tables of Web documents.It is composed of table positioning module and table information extraction module.The theme table by the contents characteristics of the Web tables is identified.The area segmentation cleans up tables and segments them into attribute and value areas by checking visual and semantic coherency.The experimental results show that this model is well performed in information extraction from tables of Web documents.
出处
《西南师范大学学报(自然科学版)》
CAS
CSCD
北大核心
2010年第4期159-163,共5页
Journal of Southwest China Normal University(Natural Science Edition)
基金
重庆市教委科学技术研究项目(KJ091309)