期刊文献+

多策略融合的中文微博数据采集方法 被引量:6

Chinese microblog data collecting method based on multiple hybrid strategies
下载PDF
导出
摘要 在基于Cookie爬虫微博数据采集方法和基于API接口微博数据采集方法的对比分析基础上,提出了一种多策略融合的中文微博数据采集方法。设计实现了广度优先的微博数据采集算法和随机活跃用户微博数据采集算法,全面高效采集中文微博中的用户ID数据、用户个人信息数据、用户微博信息数据和微博用户关注信息数据,为微博社会网络分析提供有价值的微博信息源。真实数据集上的实验结果表明,该方法不仅具有较高的采集效率,而且还具有很好的用户覆盖面。 Comparative analyzing Cookie-based crawler with API-based microblog data collecting method, a Chinese microblog data collecting method based-on multiple hybrid strategies is proposed, the ID data of users, personal information of users, microlog information data of users and information data microblog user attention are collected, valuable microblog information source is provided for microblog social network analysis. Especially a breadth-first data collecting algorithm and a random active user data collecting algorithm are designed and implemented. Online experiments show that this multiple hybrid strategy is more effective in collection and data coverage.
出处 《计算机工程与设计》 CSCD 北大核心 2013年第11期3835-3839,共5页 Computer Engineering and Design
关键词 中文微博 数据采集 搜索引擎 Cookie爬虫 信息挖掘 Chinese microblogl data collection search engine Cookie-based crawler information mining
  • 相关文献

参考文献10

  • 1高弋坤.新浪微博用户数再创新高[J].通信世界,2011(46):11-11. 被引量:3
  • 2Minas Gjoka, Maciej Kurant, Carter T Butts, et al. Practical recommendations on crawling online social networks [C] // America: Proceedings of IEEE Journal on Selected Areas in Communications, 2011:1872-1892.
  • 3Banerjee N, Chakraborty D. User interests in social media sites: An exploration with miero-blogs [C] //America: Pro ceedings of the 18th ACM Conference on Information and Knowledge Management, 2009: 1823-1826.
  • 4Galuba W. Outtweeting the twitterers predicting information cascades in mieroblogs [C] //America: Proceedings of 3rd USENIX Workshop on Online Social Networks, 2010.
  • 5Suh Bc. Want to be retweeted? Large scale analytics on factors impacting retweet in twitter network [C]//America: Second IEEE International Conference on Social Computing, 2010: 177-184.
  • 6Boyd D, Golder S, Lotan G. Tweet, tweet, retweet: Conversational aspects of retweeting on twitter [C]//America: Proceedings of 43rd Hawaii International Conference on Systems Science, 2010: 1-10.
  • 7Huberman B, Romero D, Wu F. Social networks that matter: Twitter under the microscope [J]. First Monday, 2009, 14 (1): 1-5.
  • 8Petrovic S, Osbome M, Lavrenko V. RT to win! Predicting message propagation in twitter [C] //America: Proceedings of the International AAAI Conference on Weblogs and Social Media, 2011: 586-589.
  • 9Yue Chuan, Xie Mengjun, Wang Haining. An automatic HTTP cookie management system [J]. Computer Networks-COMPUTNETW, 2010, 54 (13): 2182-2198.
  • 10郑冬冬,崔志明.Deep Web爬虫爬行策略研究[J].计算机工程与设计,2006,27(17):3154-3158. 被引量:13

二级参考文献12

  • 1Raghavan S,Garcia-Molina H.Crawling the hidden web[C].Roma,Italy:Proceedings of the 27th International Conference on Very Large Data Bases,2001.129-138.
  • 2Cormen T H,Leiserson C E,Rivest R L.Introduction to algorithms[M].2nd Edition.MIT Press/McGraw Hill,2001.
  • 3Ipeirotis P,Gravano L.Distributed search over the hidden web:Hierarchical database sampling and selection[C].VLDB,2002.
  • 4Ntoulas A,Cho J,Olston C.What's new on the web? The evolution of the web from a search engine perspective[Z].WWW,2004.
  • 5Barbosa L,Freire J.Siphoning hidden-web data through keyword-based interfaces[C].SBBD,2004.
  • 6Cope J,Craswell N,Hawking D.Automated discovery of search interfaces on the web[C].14th Australasian conference on Data Base technologies,2003.
  • 7He B,Chang K C C.Statistical schema matching across web query interfaces[C].SIGMOD Conference,2003.
  • 8Ipeirotis P G,Gravano L,Sahami M.Probe,count,and classify:Categorizing hidden web databases[C].SIGMOD,2001.
  • 9Liu V Z,Luo J C Richard C,Chu W W.Dpro:A probabilistic approach for hidden web database selection using dynamic probing[C].ICDE,2004.
  • 10Wang Jiying.Information discovery,extraction and integration for the hidden web[C].2002.

共引文献14

同被引文献47

  • 1徐凤亚,罗振声.文本自动分类中特征权重算法的改进研究[J].计算机工程与应用,2005,41(1):181-184. 被引量:56
  • 2周立柱,林玲.聚焦爬虫技术研究综述[J].计算机应用,2005,25(9):1965-1969. 被引量:154
  • 3李超锋,卢炎生.基于URL结构和访问时间的Web页面访问相似性度量[J].计算机科学,2007,34(4):207-209. 被引量:4
  • 4刘金红,陆余良.主题网络爬虫研究综述[J].计算机应用研究,2007,24(10):26-29. 被引量:132
  • 5中国互联网信息中心.第33次中国互联网络发展状况统计报告[EB/OL].http://www.cnnic.net.cn/hlwfzyj/hlwxzbg/hlwtjbg/201403/t20140305_46240.htm,2014/3/5.
  • 6Cheerio.Open source connections[M/OL].[2014-09-30].http://www.cheeriojs.github.io Cheerio.
  • 7宗成庆.统计自然语言处理[M].2版.北京:清华大学出版社,2013:460-463.
  • 8Heinselman P L, Ryzhkov A V. Validation of polarimetric hail detection [ J ]. Weather and Forecasting, 2006, 21 (5) : 839-850.
  • 9Ryzhkov A V, Kumjian M R, Ganson S M, et al. Polari- metric radar characteristics of melting hail. Part Ⅱ: Practi- cal implications [ J ]. Journal of Applied Meteorology and Climatology, 2013,52 (12) : 2871-2886.
  • 10张华平.NLPIR汉语分词系统[DB/OL].http://ict-clas.nlpir.org/,2015-07-03.

引证文献6

二级引证文献26

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部