摘要
许多大型在线求职平台由于招聘信息不统一,求职者难以在丰富冗余的求职信息中找到合适精准的信息。利用网页爬虫技术,设计实现了分布式Docker容器群集架构下招聘网站的职位数据持续爬取和分析展示。首先,利用Swarm容器管理工具构建多台物理主机的Docker群集;然后,利用Python的Scarpy框架对主流招聘网站的非结构化职位信息进行持续性分布式网络爬虫,涉及URL地址去重、数据采集、提取和清洗等,产生招聘职位的MYSQL数据库;最后,对求职数据库进行分析挖掘,生成职位数量分布热力图、岗位技能画像和可视化展示的统计图表,可为求职者提供直观的职位信息参考。
Many large online job-hunting platforms have inconsistent recruitment information, so it is difficult for job seekers to find appropriate and accurate information in the rich and redundant job-hunting information. This paper designs and implements the continuous crawling and analytical display of job data for recruitment websites based on distributed Docker container cluster by using web crawler technology. Firstly, Swarm container management tools are used to build Docker clusters of multiple physical hosts;then, we use Python’s Scarpy framework to perform a continuous distributed web crawler for unstructured job information of recruitment websites, which involves URL duplication, data collection, extraction and cleaning, to generate the MYSQL database of recruitment position;finally, we analyze and mine the job database, to generate statistical charts of job number distribution thermodynamic diagram, job skill portrait and visual display, which can provide the intuitive information reference for job seekers.
作者
张梁斌
柴晖
王渊明
万健
ZHANG Liang-bin;CHAI Hui;WANG Yuan-ming;WAN Jian(Zhejiang Wanli University, Ningbo Zhejiang 315100;Hangzhou Dianzi University, Hangzhou Zhejiang 310018)
出处
《浙江万里学院学报》
2019年第2期85-90,共6页
Journal of Zhejiang Wanli University
基金
2018年度高校访问学者“教师专业发展项目”(FX2018050)
2018年浙江省大学生科技创新项目暨新苗人才计划(2018R420016)。