摘要
面对大规模异构网页,基于视觉特征的网页信息抽取方法普遍存在通用性较差、抽取效率较低的问题。针对通用性较差的问题,该文提出了基于视觉特征的使用有监督机器学习的网页信息抽取框架WEMLVF。该框架具有良好的通用性,通过对论坛网站和新闻评论网站的信息抽取实验,验证了该框架的有效性。然后,针对视觉特征提取时间代价过高导致信息抽取效率较低的问题,该文使用WEMLVF,分别提出基于XPath和基于经典包装器归纳算法SoftMealy的自动生成信息抽取模板的方法。这两种方法使用视觉特征自动生成信息抽取模板,但模板的表达并不包含视觉特征,使得在使用模板进行信息抽取的过程中无需提取网页的视觉特征,从而既充分利用了视觉特征在信息抽取中的作用,又显著提升了信息抽取的效率,实验结果验证了这一结论。
Facing with the large-scale heterogeneous web pages, web extraction methods based on visual features tend to have poor generality and low extraction efficiency. To deal with the issue of poor generality, this paper proposes WEMLVF,a Web page information extraction framework based on visual features using supervised machine learning. This framework has good versatility. The effectiveness of the framework is validated through experiments on forum sites and news review sites. Then,to deal with the issue of low efficiency,the framework WEMLVF is utilized and method is proposed for automatically generating information extraction templates based on XPath and SoftMealy (a wrapper induction algorithm). These two methods use visual features to automatically generate information extraction templates without visual features. It makes full use of visual features information extraction and significantly improve the efficiency of information extraction,which is empirically verified.
作者
王宪发
郭岩
刘悦
俞晓明
程学旗
WANG Xianfa;GUO Yan;LIU Yue;YU Xiaoming;CHENG Xueqi(School of Computer Science and Technology,University of Chinese Academy of Sciences,Bejing 100049,China;CAS Key Laboratory of Newtwork Data Science and Technology,Institute of Computing Technology,Chinese Academy of Sciences,Beijing 100190,China)
出处
《中文信息学报》
CSCD
北大核心
2019年第5期103-112,共10页
Journal of Chinese Information Processing
基金
国家重点研发计划(2017YFB0803302
2016YFB1000902)
国家重点基础研究发展计划(973)(2014CB340405)
国家重点基础研究发展计划(973)(2014CB340401)
国家自然科学基金(61433014)
关键词
视觉特征
网络信息抽取
自动生成模板
visual features
web extraction
automatic template generation