摘要
万维网上有无数的网页,包含着海量的信息,森罗万象。但很多时候,出于数据分析或产品需求,我们需要从某些网站提取出感兴趣有价值的内容。但纵然是进化到21世纪的人类,也不可能去每一个网页点看,然后再复制粘贴,所以我们需要一种能自动获取网页内容并可以按照指定规则提取相应内容的程序,这就是爬虫。理论上来说,任何支持网络通信的语言都是可以写爬虫的,爬虫本身虽然语言关系不大,但是总有相对顺手简单的。目前大多数爬虫是用后台脚本类语言写的,下面主要介绍爬虫的基本工作原理及如何在云媒体平台上得以应用。
There are countless web pages on the World Wide Web that contain a lot of information,but whether for data analysis or product requirements,we need to extract interesting and valuable content from certain sites. Even in the 21 th century,it is impossible to click into every page and then copy and paste the information from it. Therefore we need a program that automatically retrieves content and extracts the content according to the specified rules,which is called crawler. In theory,any language that supports network communication can write crawlers. Although the reptile itself has little to do with language,there is always a relative ease. Currently most of the crawlers are written in background script language. This article mainly introduces the basic principles of crawler and how to apply them in cloud media platform.
作者
沈洁
李一凡
SHEN Jie;LI Yifan(JSCN Baron New Media Technology Co.,Ltd,Nanjing 210001,China)
出处
《中国有线电视》
2018年第5期595-597,共3页
China Digital Cable TV
关键词
雷达
电视
管理
radar
television
management