At present, big data is very popular, because it has proved to be much successful in many fields such as social media, E-commerce transactions, etc. Big data describes the tools and technologies needed to capture, man...At present, big data is very popular, because it has proved to be much successful in many fields such as social media, E-commerce transactions, etc. Big data describes the tools and technologies needed to capture, manage, store, distribute, and analyze petabyte or larger-sized datasets having different structures with high speed. Big data can be structured, unstructured, or semi structured. Hadoop is an open source framework that is used to process large amounts of data in an inexpensive and efficient way, and job scheduling is a key factor for achieving high performance in big data processing. This paper gives an overview of big data and highlights the problems and challenges in big data. It then highlights Hadoop Distributed File System (HDFS), Hadoop MapReduce, and various parameters that affect the performance of job scheduling algorithms in big data such as Job Tracker, Task Tracker, Name Node, Data Node, etc. The primary purpose of this paper is to present a comparative study of job scheduling algorithms along with their experimental results in Hadoop environment. In addition, this paper describes the advantages, disadvantages, features, and drawbacks of various Hadoop job schedulers such as FIFO, Fair, capacity, Deadline Constraints, Delay, LATE, Resource Aware, etc, and provides a comparative study among these schedulers.展开更多
We have witnessed the fast-growing deployment of Hadoop,an open-source implementation of the MapReduce programming model,for purpose of data-intensive computing in the cloud.However,Hadoop was not originally designed ...We have witnessed the fast-growing deployment of Hadoop,an open-source implementation of the MapReduce programming model,for purpose of data-intensive computing in the cloud.However,Hadoop was not originally designed to run transient jobs in which us ers need to move data back and forth between storage and computing facilities.As a result,Hadoop is inefficient and wastes resources when operating in the cloud.This paper discusses the inefficiency of MapReduce in the cloud.We study the causes of this inefficiency and propose a solution.Inefficiency mainly occurs during data movement.Transferring large data to computing nodes is very time-con suming and also violates the rationale of Hadoop,which is to move computation to the data.To address this issue,we developed a dis tributed cache system and virtual machine scheduler.We show that our prototype can improve performance significantly when run ning different applications.展开更多
In view of the existing recommendation system in the Big Data have two insufficiencies:poor scalability of the data storage and poor expansibility of the recommendation algorithm,research and analysis the IBCF algorit...In view of the existing recommendation system in the Big Data have two insufficiencies:poor scalability of the data storage and poor expansibility of the recommendation algorithm,research and analysis the IBCF algorithm and the working principle of Hadoop and HBase platform,a scheme for optimizing the design of personalized recommendation system based on Hadoop and HBase platform is proposed.The experimental results show that,using the HBase database can effectively solve the problem of mass data storage,using the MapReduce programming model of Hadoop platform parallel processing recommendation problem,can significantly improve the efficiency of the algorithm,so as to further improve the performance of personalized recommendation system.展开更多
Today,the customer’s requirements are entirely transformed.Many big retail organizations are facing sudden decline in the sales and revenues caused due to indecisive and erratic purchasing habits of recent generation...Today,the customer’s requirements are entirely transformed.Many big retail organizations are facing sudden decline in the sales and revenues caused due to indecisive and erratic purchasing habits of recent generation of users,as they get abundant preferred information such as cheaper rates,amazing offers,discounts,comparison of similar products,etc.over their smartphones or laptops hence they straightaway place order instead of walking down to showroom.As a result,large companies such as Tesco,Wal-Mart,Target,etc.have realized that it is requisite to shake hands with startup firms which already supports platform to retain customers either via deep exploration of transactional data or by offering lucrative offers in the benefit of customer and to promote market basket.The data which are generated from consumer purchase pattern,Big Data is a concern for companies as a result various big retail organizations are applying advanced and scalable data mining algorithms to precisely store and evaluate data in real-time manner to boost market basket analysis.This research work discusses various improved association rule mining(ARM)algorithms.The objective of this study is to identify gaps,providing opportunities for new research,to recognize expansion of Big Data analytics with retail environment and its future directions.This paper assimilates various aspects of parallel ARM algorithm for market basket analysis against sequential and distributed nature which are further escalated to Hadoop and MapReduce computing platform.Further various use cases highlighting the need of‘Big Data Retail Analytics’are discussed for emerging trends to promote sales and revenues,to keep check on competitor’s websites,comparison of various brands,enticing new customers.展开更多
Current popular systems, Hadoop and Spark, cannot achieve satisfied performance because of the inefficient overlapping of computation and communication when running iterative big data applications. The pipeline of com...Current popular systems, Hadoop and Spark, cannot achieve satisfied performance because of the inefficient overlapping of computation and communication when running iterative big data applications. The pipeline of computing, data movement, and data management plays a key role for current distributed data computing systems. In this paper, we first analyze the overhead of shuffle operation in Hadoop and Spark when running PageRank workload, and then propose an event-driven pipeline and in-memory shuffle design with better overlapping of computation and communication as DataMPI- Iteration, an MPI-based library, for iterative big data computing. Our performance evaluation shows DataMPI-Iteration can achieve 9X-21X speedup over Apache Hadoop, and 2X-3X speedup over Apache Spark for PageRank and K-means.展开更多
Light detection and ranging(LiDAR)data are essential for scientific discoveries such as Earth and ecological sciences,environmental applications,and responding to natural disasters.While collecting LiDAR data over lar...Light detection and ranging(LiDAR)data are essential for scientific discoveries such as Earth and ecological sciences,environmental applications,and responding to natural disasters.While collecting LiDAR data over large areas is quite possible the subsequent processing steps typically involve large computational demands.Efficiently storing,managing,and processing LiDAR data are the prerequisite steps for enabling these LiDAR-based applications.However,handling LiDAR data poses grand geoprocessing challenges due to data and computational intensity.To tackle such challenges,we developed a general-purpose scalable framework coupled with a sophisticated data decomposition and parallelization strategy to efficiently handle‘big’LiDAR data collections.The contributions of this research were(1)a tile-based spatial index to manage big LiDAR data in the scalable and fault-tolerable Hadoop distributed file system,(2)two spatial decomposition techniques to enable efficient parallelization of different types of LiDAR processing tasks,and(3)by coupling existing LiDAR processing tools with Hadoop,a variety of LiDAR data processing tasks can be conducted in parallel in a highly scalable distributed computing environment using an online geoprocessing application.A proof-of-concept prototype is presented here to demonstrate the feasibility,performance,and scalability of the proposed framework.展开更多
文摘At present, big data is very popular, because it has proved to be much successful in many fields such as social media, E-commerce transactions, etc. Big data describes the tools and technologies needed to capture, manage, store, distribute, and analyze petabyte or larger-sized datasets having different structures with high speed. Big data can be structured, unstructured, or semi structured. Hadoop is an open source framework that is used to process large amounts of data in an inexpensive and efficient way, and job scheduling is a key factor for achieving high performance in big data processing. This paper gives an overview of big data and highlights the problems and challenges in big data. It then highlights Hadoop Distributed File System (HDFS), Hadoop MapReduce, and various parameters that affect the performance of job scheduling algorithms in big data such as Job Tracker, Task Tracker, Name Node, Data Node, etc. The primary purpose of this paper is to present a comparative study of job scheduling algorithms along with their experimental results in Hadoop environment. In addition, this paper describes the advantages, disadvantages, features, and drawbacks of various Hadoop job schedulers such as FIFO, Fair, capacity, Deadline Constraints, Delay, LATE, Resource Aware, etc, and provides a comparative study among these schedulers.
文摘We have witnessed the fast-growing deployment of Hadoop,an open-source implementation of the MapReduce programming model,for purpose of data-intensive computing in the cloud.However,Hadoop was not originally designed to run transient jobs in which us ers need to move data back and forth between storage and computing facilities.As a result,Hadoop is inefficient and wastes resources when operating in the cloud.This paper discusses the inefficiency of MapReduce in the cloud.We study the causes of this inefficiency and propose a solution.Inefficiency mainly occurs during data movement.Transferring large data to computing nodes is very time-con suming and also violates the rationale of Hadoop,which is to move computation to the data.To address this issue,we developed a dis tributed cache system and virtual machine scheduler.We show that our prototype can improve performance significantly when run ning different applications.
文摘In view of the existing recommendation system in the Big Data have two insufficiencies:poor scalability of the data storage and poor expansibility of the recommendation algorithm,research and analysis the IBCF algorithm and the working principle of Hadoop and HBase platform,a scheme for optimizing the design of personalized recommendation system based on Hadoop and HBase platform is proposed.The experimental results show that,using the HBase database can effectively solve the problem of mass data storage,using the MapReduce programming model of Hadoop platform parallel processing recommendation problem,can significantly improve the efficiency of the algorithm,so as to further improve the performance of personalized recommendation system.
文摘Today,the customer’s requirements are entirely transformed.Many big retail organizations are facing sudden decline in the sales and revenues caused due to indecisive and erratic purchasing habits of recent generation of users,as they get abundant preferred information such as cheaper rates,amazing offers,discounts,comparison of similar products,etc.over their smartphones or laptops hence they straightaway place order instead of walking down to showroom.As a result,large companies such as Tesco,Wal-Mart,Target,etc.have realized that it is requisite to shake hands with startup firms which already supports platform to retain customers either via deep exploration of transactional data or by offering lucrative offers in the benefit of customer and to promote market basket.The data which are generated from consumer purchase pattern,Big Data is a concern for companies as a result various big retail organizations are applying advanced and scalable data mining algorithms to precisely store and evaluate data in real-time manner to boost market basket analysis.This research work discusses various improved association rule mining(ARM)algorithms.The objective of this study is to identify gaps,providing opportunities for new research,to recognize expansion of Big Data analytics with retail environment and its future directions.This paper assimilates various aspects of parallel ARM algorithm for market basket analysis against sequential and distributed nature which are further escalated to Hadoop and MapReduce computing platform.Further various use cases highlighting the need of‘Big Data Retail Analytics’are discussed for emerging trends to promote sales and revenues,to keep check on competitor’s websites,comparison of various brands,enticing new customers.
文摘Current popular systems, Hadoop and Spark, cannot achieve satisfied performance because of the inefficient overlapping of computation and communication when running iterative big data applications. The pipeline of computing, data movement, and data management plays a key role for current distributed data computing systems. In this paper, we first analyze the overhead of shuffle operation in Hadoop and Spark when running PageRank workload, and then propose an event-driven pipeline and in-memory shuffle design with better overlapping of computation and communication as DataMPI- Iteration, an MPI-based library, for iterative big data computing. Our performance evaluation shows DataMPI-Iteration can achieve 9X-21X speedup over Apache Hadoop, and 2X-3X speedup over Apache Spark for PageRank and K-means.
基金This study was funded by University of South Carolina through the ASPIRE(Advanced Support for Innovative Research Excellence)program[13540-16-41796]Additional funding was provided by the South Carolina Department of Transportation under contract to the University of South Carolina[SPR#707 or USC 13540FB11]+1 种基金USGS[G15AC00085]NSF-BCS[1455349].
文摘Light detection and ranging(LiDAR)data are essential for scientific discoveries such as Earth and ecological sciences,environmental applications,and responding to natural disasters.While collecting LiDAR data over large areas is quite possible the subsequent processing steps typically involve large computational demands.Efficiently storing,managing,and processing LiDAR data are the prerequisite steps for enabling these LiDAR-based applications.However,handling LiDAR data poses grand geoprocessing challenges due to data and computational intensity.To tackle such challenges,we developed a general-purpose scalable framework coupled with a sophisticated data decomposition and parallelization strategy to efficiently handle‘big’LiDAR data collections.The contributions of this research were(1)a tile-based spatial index to manage big LiDAR data in the scalable and fault-tolerable Hadoop distributed file system,(2)two spatial decomposition techniques to enable efficient parallelization of different types of LiDAR processing tasks,and(3)by coupling existing LiDAR processing tools with Hadoop,a variety of LiDAR data processing tasks can be conducted in parallel in a highly scalable distributed computing environment using an online geoprocessing application.A proof-of-concept prototype is presented here to demonstrate the feasibility,performance,and scalability of the proposed framework.