期刊文献+

基于视觉特征的就业信息页面抽取方法 被引量:2

Employment Information Page Extraction Method based on Visual Features
下载PDF
导出
摘要 随着网络技术的发展,互联网上出现了大量的就业信息,但信息数据零散的以不同的呈现方式展现在各种就业信息网站中。为了解决传统的Web信息抽取方法中准确率和效率相矛盾的问题,本文通过采用基于网页视觉特征的模板生成方法,提出了一种基于网页视觉特征的就业信息页面抽取方法,在保证抽取准确率的同时,尽可能减少人工干预。该方法通过分析网页视觉特征自动生成初始模板,并通过人工配置,生成最终网页抽取模板。通过此方法,实现了将互联网上零散的就业数据转换为统一的数据格式存储下来。实验结果表明,本文提出的抽取方法抽取的准确率和召回率都很高,取得了比较好的抽取结果。 With the development of network technology, lots of employment information pop up all over Internet. But information in various employment information website is showed scattered. In order to solve the contradiction between accuracy and efficiency in the traditional method of web information extraction, this paper uses the template generation method based on visual features, and puts forward a kind of employment information page extraction method based on visual features. This method can reduce manual intervention and ensure the extraction accuracy at the same time. This method generates the initial template through the analysis of web visual features automatically, and generates the final web extraction template through manual configuration. The scattered employment data on the Internet is converted to the unified format data through this method. The experimental results show that the rate of accuracy and recall is high, and achieve good results of extraction.
出处 《软件》 2014年第9期16-20,共5页 Software
基金 国家科技支撑计划课题(2013BAH10F01)项目"劳动者全生命周期的就业信息服务系统及应用示范" 高等学校博士学科点专项科研基金课题(20110005120007) 北京高等学校青年英才计划项目(YETP0445) 教育部信息网络工程研究中心 北京市教育委员会共建项目专项资助
关键词 WEB信息抽取 模板 VIPS DOM树 XPATH Web Information Extraction Template VIPS DOM Tree XPath
  • 相关文献

参考文献7

二级参考文献54

共引文献251

同被引文献22

  • 1于满泉,陈铁睿,许洪波.基于分块的网页信息解析器的研究与设计[J].计算机应用,2005,25(4):974-976. 被引量:55
  • 2吴倩,杨逍,张兆心.基于视觉特征的网页信息提取[C]//第六届全国信息检索学术会议论文集,2010.
  • 3Deng Cai, Shipeng Yu, Ji-Rong Wen, 等. Extracting Content Structure for Web Pages based on Visual Representation [C].Proc Asia Pacific Web Conference.2003:406-417.
  • 4Bhavdeep Mehta,Meera Narvekar. DOM Tree Based Approach for Web Content Extraction [C].India,ICCICT,2015:16-17.
  • 5INarwal,Neetu.Improving web data extraction by noise removal.Communication and Comput- ing (ARTCom) [C].2013:388-395.
  • 6Mr.Satish J. Pusdekar, Pro.Shaikh.phiroj Chhaware. Using Visual Clues Concept for Extracting Main Data from Deep Web Pages. International Conference on Electronic Systems, Signal Processing and Computing Technologies [C].2014:190-193.
  • 7陈劲,林怀忠,陈方疏,等.一种从中文网页中抽取信息的综合方法.计算机研究与发展[S].2012.171-178.
  • 8Lei Fu, Yao Meng,Yingju XIA,等.Content Ex- traction based on Webpage Layout Analysis[C]. IEEE,2010:40-43.
  • 9Madhavan J, Ko D, Kot L, et al. Google' s deep web crawl [ J ]. Proceedings of the VLDB Endowment, 2008,1 ( 2 ) : 1241 -1252.
  • 10Stevanovic D, An Aijun, Vlajic N. Feature evaluation for Web crawler detection with data mining techniques [ J ]. Expert Sys- tems with Applications,2012,39(10) :8707-8717.

引证文献2

二级引证文献4

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部