摘要
网页具有丰富的内容和复杂多变的结构,现有的网页信息提取技术解决了单记录型简单页面的信息提取问题,但是对于多记录型复杂页面的信息提取效果往往不佳。文中提出了一种全新的基于可视块的复杂网页信息自动化提取算法(Visual Block Based Information Extraction,VBIE),通过启发式规则构建可视块与可视块树,然后通过区域聚焦、噪声过滤及可视块筛选,实现了对复杂网页中数据记录的提取。该方法摒弃了以往算法对网页结构的特定假设,无需对HTML文档进行任何人工标记,保留了网页的原始结构,且能够在单页面上实现无监督的信息提取。实验结果表明,VBIE的网页信息提取精确度最高可达100%,在主流搜索引擎的结果页面和社区论坛的帖子页面上的F1均值分别为98.5%和96.1%。相比目前方法中在复杂网页上提取效果较好的CMDR方法,VBIE的F1值提高了近16.3%,证明了该方法能够有效解决复杂网页的信息提取问题。
The webpage has rich content and complicated and varied structure.The existing webpage information extraction technology solves the information extraction of the single-recording simple page,but the information extraction effect of the multi-recording complex page is often poor.This paper proposed a new visual block based information extraction algorithm,named visual block based information extraction (VBIE).By constructing visual blocks and visual block trees,and through heuristic rules,regional focus,noise filtering and visual block filtering,data record extraction is realized for complex webpages.Compared with other existing methods,this method abandons the specific assumptions of the previous algorithm on the structure of the webpage,does not need to manually mark the HTML document,preserves the original structure of the webpage,and can realize unsupervised information extraction on a single page.The experimental results show that VBIE’s webpage information extraction accuracy is up to 100%,and the average value of F1 on the results page of the mainstream search engine and the post page of the community forum are 98.5% and 96.1%.Compared with the current method CMDR,the F1 value of VBIE is improved by nearly 16.3%,which proves that the method can effectively solve the information extraction task of complex webpages.
作者
王卫红
梁朝凯
闵勇
WANG Wei-hong;LIANG Chao-kai;MIN Yong(College of Computer Science and Technology,Zhejiang University of Technology,Hangzhou 310023,China)
出处
《计算机科学》
CSCD
北大核心
2019年第10期63-70,共8页
Computer Science
基金
浙江省自然科学基金(LY17G030030,LGF18D010001,LGF18D010002)资助