摘要
随着网络技术的发展,互联网上出现了大量的就业信息,但信息数据零散的以不同的呈现方式展现在各种就业信息网站中。为了解决传统的Web信息抽取方法中准确率和效率相矛盾的问题,本文通过采用基于网页视觉特征的模板生成方法,提出了一种基于网页视觉特征的就业信息页面抽取方法,在保证抽取准确率的同时,尽可能减少人工干预。该方法通过分析网页视觉特征自动生成初始模板,并通过人工配置,生成最终网页抽取模板。通过此方法,实现了将互联网上零散的就业数据转换为统一的数据格式存储下来。实验结果表明,本文提出的抽取方法抽取的准确率和召回率都很高,取得了比较好的抽取结果。
With the development of network technology, lots of employment information pop up all over Internet. But information in various employment information website is showed scattered. In order to solve the contradiction between accuracy and efficiency in the traditional method of web information extraction, this paper uses the template generation method based on visual features, and puts forward a kind of employment information page extraction method based on visual features. This method can reduce manual intervention and ensure the extraction accuracy at the same time. This method generates the initial template through the analysis of web visual features automatically, and generates the final web extraction template through manual configuration. The scattered employment data on the Internet is converted to the unified format data through this method. The experimental results show that the rate of accuracy and recall is high, and achieve good results of extraction.
出处
《软件》
2014年第9期16-20,共5页
Software
基金
国家科技支撑计划课题(2013BAH10F01)项目"劳动者全生命周期的就业信息服务系统及应用示范"
高等学校博士学科点专项科研基金课题(20110005120007)
北京高等学校青年英才计划项目(YETP0445)
教育部信息网络工程研究中心
北京市教育委员会共建项目专项资助