摘要
近年来,依托视频行业与直播行业的兴盛,弹幕服务迅速发展。然而主流弹幕服务的弹幕环境一直以来缺乏监管,主播与用户违规行为屡禁不止,对直播弹幕的学术研究稀少,亟需开发针对弹幕的采集处理方案。针对国内知名弹幕服务的技术特征,设计一种分布式直播弹幕爬虫系统方案。分析并提出相应房间连接的建立机制与弹幕采集机制:对开放API的服务直接采用轻量级客户端实现;对基于Adobe Flash且不开放API的服务,用基于Chromium浏览器的Electron模拟浏览直播间网页,并改写其PPAPI插件界面实现,旁路Flash网络流量从而实现抓取。在某知名弹幕平台上进行了验证性实验,表明该系统能够调度IP地址资源进行较大规模抓取,且性能较好,能够处理平均134条每秒、峰值超过1 000条每秒的弹幕流量。
Danmaku services have grown in popularity recently thanks to the blooming of online video and livestreaming industries. However,the environment is seriously under-regulated,with frequent rule violations from both casters and audience,resulting in negative publicity of the industry,hence a need for research on danmaku crawling and processing is realized. A distributed live-streaming danmaku crawler scheme was devised according to technical characteristics of popular Chinese danmaku services. Light-weight clients were developed for services implementing open standards. For services based on Adobe Flash and proprietary interfaces,Chromium browser-based Electron with modified PPAPI plugin interface implementation was used to simulate browsing of streaming room pages,side-channeling Flash network traffic for crawling. Experiments on a well-known danmaku service showed that the system was able to dispatch IP addresses for large-scale crawling and the performance was good with achieving an average speed of 134 per second and peak speed of more than 1 000 per second for the danmaku traffic.
出处
《计算机应用与软件》
北大核心
2018年第2期134-140,共7页
Computer Applications and Software
基金
国家科技支撑计划课题(2015BAH54F00)
国家重点研发计划项目(2016YFB0800305)
关键词
直播弹幕
爬虫浏览器
模拟PPAPI旁路
Live-streaming danmaku
Crawler
Browser simulation
PPAPI side-channeling