期刊文献+

基于微博API的分布式抓取技术 被引量:7

A Distributed Data-Crawling Technology for Microblog API
下载PDF
导出
摘要 随着微博用户的迅猛增长,越来越多的人希望从用户的行为和微博内容中挖掘有趣的模式。针对如何对微博数据进行有效合理的采集,提出了基于微博API的分布式抓取技术,通过模拟微博登录自动授权,合理控制API的调用频次,结合任务分配控制器高效地获取微博数据。该分布式抓取技术还结合时间触发和内存数据库技术实现重复控制,避免了数据的重复爬取和重复存储,提高了系统的性能。本分布式抓取技术具有可扩展性高、任务分配明确、效率高、多种爬取策略适应不同的爬取需求等特点。新浪微博数据爬取实例验证了该技术的可行性。 As more and more users begin to use microblog, people eagerly want to dig interesting patterns from the microblog data. How to efficiently collect data from the service provider is one of the main challenges. To address this issue, a distributed crawling solution based on microblog API was present. The distributed crawling solution simulates microblog login, automatically gets authorized, and control the invoked frequency of the API with a task controller. A time trigger method with memory database was also proposed to avoid extra trivial data duplication and improve efficiency of the system. In the distributed framework, the crawling tasks can be assigned to distributed clients independently, which ensures the high scalability and flexibility of the crawling procedure. The feasibility of the crawler technology according to Sina microblog instance was verified.
出处 《电信科学》 北大核心 2013年第8期146-150,155,共6页 Telecommunications Science
关键词 新浪微博 爬取策略 分布式爬取 微博API Sina microblog, crawling strategy, distributed crawl, microblog API
  • 相关文献

参考文献7

二级参考文献51

  • 1欧健文,董守斌,蔡斌.模板化网页主题信息的提取方法[J].清华大学学报(自然科学版),2005,45(S1):1743-1747. 被引量:70
  • 2戴媛,姚飞.基于网络舆情安全的信息挖掘及评估指标体系研究[J].情报理论与实践,2008,31(6):873-876. 被引量:75
  • 3周立柱,林玲.聚焦爬虫技术研究综述[J].计算机应用,2005,25(9):1965-1969. 被引量:153
  • 4郑冬冬,崔志明.Deep Web爬虫爬行策略研究[J].计算机工程与设计,2006,27(17):3154-3158. 被引量:13
  • 5Pieter N, Michiel H. Mining Twitter in the cloud: A case study [C]// Proceedings of the 2010 IEEE 3rd International Conference on Cloud Computing, CLOUD 2010. Miami, USA: IEEE Computer Society, 2010: 107 -114.
  • 6Abraham R, Martinez T. Twitter: Network properties analysis [C]// Proceedings of the CONIELECOMP 2010 20th International Conference on Electronics Communications and Computers. Cholula Puebla, Mexico: IEEE Computer Society, 2010: 180 - 184.
  • 7wenE,SunV.新浪微博研究报告[Z/OL].(2011-05-20),http://www.techweb.com.cn/data/2011-02-25/916941.shtml.
  • 8HAN Ruixia. The influence of microblogging on personal public participation [C]// Proceedings of the 2010 IEEE 2nd Symposium on Web Society, SWS 2010. Beijing, China: Association for Computing Machinery, 2010:615 -618.
  • 9KANG Shulong, ZHANG Chuang. Complexity research of massively microhlogging based on human behaviors [C]//2010 2nd International Workshop on Database Technology and Applications, DBTA2010 Proceedings. Wuhan, China: IEEE Computer Society, 2010: 1 -4.
  • 10WANG Rui, JIN Yongsheng. An empirical study on the relationship between the followers' number and influence of microblogging [C]// Proceedings of the International Conference on E-Business and E-Government, ICEE 2010. Guangzhou, China: IEEE Computer Society, 2010: 2014- 2017.

共引文献302

同被引文献54

引证文献7

二级引证文献23

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部