期刊文献+

基于众包的社交网络数据采集模型设计与实现 被引量:14

Design and Implementation of Crowdsourcing-based Social Network Data Collection Model
下载PDF
导出
摘要 社交网络数据信息量大、主题性强,具有巨大的数据挖掘价值,是互联网大数据的重要组成部分。针对传统搜索引擎无法利用关键字检索技术直接索引社交网络平台信息的现状,基于众包模式,采用C/S架构,设计社交网络数据采集模型,包含服务端、客户端、存储系统与主题Deep Web爬虫系统4个模块。通过主题Deep Web爬虫的分布式机器节点自动向服务器请求爬虫任务并上传爬取数据,利用Hadoop分布式文件系统对爬取数据进行快速处理并存储结果数据。实验结果表明,主题Deep Web爬虫系统配置简单,支持功能扩展和目标信息直接获取,数据采集模型具有较快的数据获取速度及较高的信息检索效率。 Social network data has the features of informative and strong topicality with significant value for data mining,and it is also a very important part of the Internet big data. How ever,traditional search engines can not use the keyw ords retrieve technology to index the information of social netw ork platform directly,and under such circumstances,this paper designs and implements a data collection model based on crow dsourcing mode and C / S architecture. The model consists of four modules including server,client,storage sub-system and a Deep Web craw ler system. The nodes run the topic Deep Web craw ler system to request new tasks automatically and upload the acquired data,meanw hile the system uses the Hadoop Distributed File System( HDFS) to process data rapidly and store results. The topic Deep Web craw ler system has the features of easy configuration,flexible scalability and direct data collection,and it also proves that data collection model is able to fulfill the tasks in a high success rate and collect data in an efficient w ay.
出处 《计算机工程》 CAS CSCD 北大核心 2015年第4期36-40,共5页 Computer Engineering
基金 国家"863"计划基金资助项目"基于媒体大数据的大众信息消费服务平台及应用示范"(SS2014AA012305)
关键词 社交网络 众包模式 分布式计算 信息采集 WEB爬虫 HADOOP分布式文件系统 social network crowdsourcing mode distributed computing information collection Web crawler Hadoop Distributed File System(HDFS)
  • 相关文献

参考文献9

  • 1黄延炜,刘嘉勇.新浪微博数据获取技术研究[J].信息安全与通信保密,2013,11(6):71-73. 被引量:22
  • 2Prabhakar C. Cloud Computing with Amazon Web Services, Part5: Dataset Processing in the Cloud with SimpleDB [ EB/OL ]. ( 2009-05-11 ). http ://www. ibm.
  • 3Hadoop[ EB/OL ]. [ 2013-05-28 ]. http ://hadoop. apache. org/.
  • 4Chang F, Dean J, Ghemawat S, et al. Bigtable: A Distributed Storage System for Structured Data[ J ] ACM Transactions on Computer Systems ,2008,26 ( 2 ) 4-12.
  • 5HttpClient Tutorial [ EB/OL ]. [ 2013-05-28 ]. http:// hc. apache, org/httpcomponents-client-ga/tutorial/pdf/ httpclient-tutorial, pdf.
  • 6Hayes B. Cloud Computing [J] . Communications of the ACM,2008,51 (7) :9-11.
  • 7Konstantin S,Hairong K, Sanjay R, et al. The Hadoop Distributed File System [ C ]//Proceedings of the 26th Symposium on Mass Storage Systems and Technologies. Washington D. C. , USA : IEEE Computer Society, 2010 : 1-10.
  • 8陈康,郑纬民.云计算:系统实例与研究现状[J].软件学报,2009,20(5):1337-1348. 被引量:1311
  • 9崔杰,李陶深,兰红星.基于Hadoop的海量数据存储平台设计与开发[J].计算机研究与发展,2012,49(S1):12-18. 被引量:141

二级参考文献39

  • 1Sims K. IBM introduces ready-to-use cloud computing collaboration services get clients started with cloud computing. 2007. http://www-03.ibm.com/press/us/en/pressrelease/22613.wss
  • 2Boss G, Malladi P, Quan D, Legregni L, Hall H. Cloud computing. IBM White Paper, 2007. http://download.boulder.ibm.com/ ibmdl/pub/software/dw/wes/hipods/Cloud_computing_wp_final_8Oct.pdf
  • 3Zhang YX, Zhou YZ. 4VP+: A novel meta OS approach for streaming programs in ubiquitous computing. In: Proc. of IEEE the 21st Int'l Conf. on Advanced Information Networking and Applications (AINA 2007). Los Alamitos: IEEE Computer Society, 2007. 394-403.
  • 4Zhang YX, Zhou YZ. Transparent Computing: A new paradigm for pervasive computing. In: Ma JH, Jin H, Yang LT, Tsai JJP, eds. Proc. of the 3rd Int'l Conf. on Ubiquitous Intelligence and Computing (UIC 2006). Berlin, Heidelberg: Springer-Verlag, 2006. 1-11.
  • 5Barroso LA, Dean J, Holzle U. Web search for a planet: The Google cluster architecture. IEEE Micro, 2003,23(2):22-28.
  • 6Brin S, Page L. The anatomy of a large-scale hypertextual Web search engine. Computer Networks, 1998,30(1-7): 107-117.
  • 7Ghemawat S, Gobioff H, Leung ST. The Google file system. In: Proc. of the 19th ACM Symp. on Operating Systems Principles. New York: ACM Press, 2003.29-43.
  • 8Dean J, Ghemawat S. MapReduce: Simplified data processing on large clusters. In: Proc. of the 6th Symp. on Operating System Design and Implementation. Berkeley: USENIX Association, 2004. 137-150.
  • 9Burrows M. The chubby lock service for loosely-coupled distributed systems. In: Proc. of the 7th USENIX Symp. on Operating Systems Design and Implementation. Berkeley: USENIX Association, 2006. 335-350.
  • 10Chang F, Dean J, Ghemawat S, Hsieh WC, Wallach DA, Burrows M, Chandra T, Fikes A, Gruber RE. Bigtable: A distributed storage system for structured data. In: Proc. of the 7th USENIX Symp. on Operating Systems Design and Implementation. Berkeley: USENIX Association, 2006. 205-218.

共引文献1465

同被引文献117

引证文献14

二级引证文献85

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部