摘要
当前,来自国外网站的互联网开源科技信息已经成为科技情报的重要表现形式和组成部分,利用垂直爬取技术抽取、集成、解析、跟踪、研究这些网页信息可帮助科研人员实时、全面、深入地了解领域内的研究现状。然而国内目前访问国外某些网站困难;且国外很多网站都加强了反爬虫技术策略与应用,爬虫技术总是不断被反爬虫技术超越,特定主题内容规模化信息获取尤为困难。采用简单的搜索方式难以获取,且有些信息具有很强的时效性,人工跟踪难度大、时间耗费多,不利于数据的长期积累。为此,我们重点针对开源信息获取的反爬虫技术开展了研究,提出针对性的解决方案,系统地介绍了反爬虫技术和爬虫技术的应用。
Currently,the Internet-based science information originating from foreign key websites has become an important form and an integral part of scientific intelligence.To extract,integrate and parse those web page information by using vertical crawling technology helps scientific researchers gain an overall but in-depth understanding of the up-to-date scientific achievements in various fields in real time.But it is difficult to have access to some of foreign websites as they have also increased the research and application of anti-crawling technology.With the crawling technology surpassed by anti-crawling technology,it becomes particularly difficult to obtain information on topicspecific contents in large scale.We analyze typical scientific websites based in foreign countries to give systematic introduction of crawling and anti-crawling technologies and corresponding solutions.
作者
张晔
孙光光
徐洪云
庞婷
曲潇洋
ZHANG Ye;SUN Guangguang;XU Hongyun;PANG Ting;QU Xiaoyang(Northern Science and Technology Information Institute,Beijing 100089,China)
出处
《竞争情报》
2020年第1期24-28,共5页
Competitive Intelligence