期刊文献+

基于Hadoop平台的农产品价格数据爬取和存储系统的研究 被引量:4

RESEARCH ON DATA CRAWLING AND STORAGE SYSTEM OF AGRICULTURAL PRODUCT PRICE BASED ON HADOOP PLATFORM
下载PDF
导出
摘要 目前许多大型农贸市场和农业信息商务平台都在实时发布每天各地区不同农产品的价格数据。针对数据更新快、数据量大、数据形式多样,使数据的爬取和存储以及后续的分析工作变得困难,提出基于Hadoop的农产品价格爬取及存储系统。利用HttpClient框架结合线程池通过多线程爬取,爬取结束后执行完整性检查,过滤出信息不完整的网页,进行二次爬取直到信息完整。对爬取到的网页使用正则表达式进行解析和清洗,提取有用的数据,以文本文件的形式存入HDFS(Hadoop Distributed File System),此后爬取到的数据以追加的方式写入HDFS文件中。实验表明HDFS的写入性能满足爬取数据不断递增的现状,副本数越少,数据块越大,写入性能越好。 At present, many large farm product markets and agricultural information commerce platforms release the information of agricultural product prices from different regions in real-time each day. Because of a large number Of various fast-updating data, the data crawling and storage as well as the following analysis work come to be difficult. Therefore, we put forward a data crawling and storage system of agricultural product price based on Hadoop. We implement multi-threaded crawling by HttpClient framework combined with thread pool and finish integrity checking. After filtering out the web pages whose information is incomplete, we crawl again until the information comes to be complete. We analyze and clean the crawled web pages by regular expression, and save the useful extracted data in the form of text file into HDFS ( Hadoop Distributed File System). The data crawled later is supplemented into HDFS. Experiment shows that the writing performance of HDFS can satisfy the incremental crawling data. The less duplicates are, the bigger the data block is, then the better the writing performance is.
作者 杨晓东 郜鲁涛 杨林楠 刘建阳 Yang Xiaodong Gao Lutao Yang Linnan Liu Jianyang(College of Basic Science and Information Engineering, Yunnan Agriculture University, Kunming 650201, Yunnan, China Yunnan Information Technology Development Center, Kunming 650228, Yunnan, China)
出处 《计算机应用与软件》 2017年第3期76-80,共5页 Computer Applications and Software
基金 国家"十二五"科技支撑计划课题(2014BAD10B03)
关键词 分布式系统 爬虫 HADOOP HDFS 正则表达式 Distributed system Crawler Hadoop HDFS Regular expression
  • 相关文献

参考文献8

二级参考文献91

  • 1杨学明,刘柏嵩.主题爬虫在数字图书馆中的应用[J].图书馆杂志,2007,26(8):47-50. 被引量:3
  • 2周立柱,林玲.聚焦爬虫技术研究综述[J].计算机应用,2005,25(9):1965-1969. 被引量:153
  • 3夏崇镨,康丽.基于叙词表的主题爬虫技术研究[J].现代图书情报技术,2007(5):41-44. 被引量:8
  • 4石磊,孟彩霞,韩英杰.基于预测的Web缓存替换策略[J].计算机应用,2007,27(8):1842-1845. 被引量:6
  • 5Tom White.Hadoop权威指南[M].2版.北京:清华大学出版社,2011.
  • 6Armbrust M, Fox A. Griffith R, et al. Above the Clouds: A Berkeley View of Cloud Computing[ D ]. UCB/EECS-2009-28, EECS Department, University of California, Berkeley, 2009.
  • 7Tom White. Hadoop: The Definitive Guide[M]. 2rid ed. O' Reilly Media, Inc ,2011.
  • 8Konstantin Shvachko , Hairing Kuang , Sanyjy Radia , et al. The Ha- doop Distributed File System [ C ]//Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), May 03 -07, 2010:1 -10.
  • 9Hadooparchives[ OL]. http ://hadoop. apache. org/common/docs/current/hadoop_ archives. html.
  • 10Sequence File Wiki [ OL ]. http ://wiki. apache.org/hadoop/Seq uen ce File.

共引文献232

同被引文献44

引证文献4

二级引证文献9

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部