摘要
随着互联网信息量的日益剧增,基于分布式的网络爬虫系统已经成为未来的一个重要发展趋势.文章利用提供的初始URL种子,通过分布式网络爬虫系统抓取海量音频,同时也对分布式网络爬虫中的媒体音频真实地址解析,URL去重、分布式任务调度、sniffer嗅探等技术进行了研究和探索.实验结果表明,基于分布式的海量音频爬虫系统能以较少的时间代价准确地抓取海量符合需求的音频.
With the sharp increase of information on the Internet, the web crawler system which is based on the distributed system has become an important development trend in the future. In this paper, the proposed distributed web crawler system can be employed to collect massive audio by using the initial URL seeds. In addition, how to analyze the real address of the audio by using sniffer technique, how to implement the task scheduling in distributed system and how to remove the duplicated URLs are also investigated. Experiments show that the web crawler based on the distributed system can collect a large number of audio from the Internet exactly in a short time.
出处
《韩山师范学院学报》
2015年第6期28-34,共7页
Journal of Hanshan Normal University
基金
广东省自然科学基金项目(项目编号:2014A030310038)
广东省教育厅科研项目(项目编号:2013KJCX0127)
广东省2013年高等教育教学改革项目