摘要
为集成行业网站资讯并作排名参考、舆情监控、热点提取等场景应用,组织研发了一种通用的行业网站资讯集成平台。该集成平台需要研发行业网站爬虫系统、行业网站数据处理系统、行业网站数据展示系统这3个软件系统。在分析清楚这种通用的行业网站资讯集成平台的功能需求的基础上,给出了该平台的总体技术架构和3个软件系统的设计。给出了增量式爬取网页、二分法提取资讯类网页、预测网页标题等关键技术实现的原理。经研发实现,该集成平台已应用在全国物流行业与湖南统战系统形成行业网站资讯集成平台。全国物流行业网站资讯集成平台已集成10个网站,爬取到313199个网页;湖南统战系统网站资讯集成平台已集成26个网站,爬取到64216个网页。
To integrate information in industry website and make scenario applications such as ranking reference,public opinion monitoring and hotspot extraction,a General Platform for Integrating Information in Industry Website (GPIIIW) needs to be developed.GPIIIW needs to develop three software systems including a crawler system to industry website:a data processing system to industry website,a data display system to industry website.Based on the analysis of the functional requirements of GPIIIW,the overall technical architecture of the platform and the design of three software systems are given.The principles of incremental crawling webpages,dichotomy extraction of information webpages,prediction of webpage titles and other key technologies are given.After GPIIIW is developed,the integrated platform has been applied in the logistics industry of China and the united front system of Hunan to form their industry information platform.The GPIIIW of logistics industry of China has integrated 10 websites and crawled 313,199 webpages.The GPIIIW of united front system of Hunan has integrated 26 websites and crawled to 64,216 webpages.
作者
邓子云
DENG Ziyun(Changsha Commerce&Tourism College,Changsha410116,China)
出处
《工业技术与职业教育》
2022年第2期10-14,共5页
Industrial Technology and Vocational Education
基金
湖南省自然科学基金课题“一种支持多过滤方法组合的海量网页过滤智能引擎的研制与应用”(项目编号:2020JJ7091),主持人邓子云
国家自然科学青年基金“小样本驱动的风电监控系统网络攻击深度检测方法”(项目编号:62103143),主持人陈磊。
关键词
行业网站
Scrapy爬虫
集成平台
网页分类
提取标题
industry website
scrapy crawler
integrated platform
webpage classification
title extraction