The data and internet are highly growing which causes problems in management of the big-data.For these kinds of problems,there are many software frameworks used to increase the performance of the distributed system.Th...The data and internet are highly growing which causes problems in management of the big-data.For these kinds of problems,there are many software frameworks used to increase the performance of the distributed system.This software is used for the availability of large data storage.One of the most beneficial software frameworks used to utilize data in distributed systems is Hadoop.This paper introduces Apache Hadoop architecture,components of Hadoop,their significance in managing vast volumes of data in a distributed system.Hadoop Distributed File System enables the storage of enormous chunks of data over a distributed network.Hadoop Framework maintains fsImage and edits files,which supports the availability and integrity of data.This paper includes cases of Hadoop implementation,such as monitoring weather,processing bioinformatics.展开更多
Purpose-The purpose of this paper is to make an overview of 6,618 publications of Apache Hadoop from 2008 to 2020 in order to provide a conclusive and comprehensive analysis for researchers in this field,as well as a ...Purpose-The purpose of this paper is to make an overview of 6,618 publications of Apache Hadoop from 2008 to 2020 in order to provide a conclusive and comprehensive analysis for researchers in this field,as well as a preliminary knowledge of Apache Hadoop for interested researchers.Design/methodology/approach-This paper employs the bibliometric analysis and visual analysis approaches to systematically study and analyze publications about Apache Hadoop in the Web of Science database.This study aims to investigate the topic of Apache Hadoop by means of bibliometric analysis with the aid of visualization applications.Through the bibliometric analysis of the collected documents,this paper analyzes the main statistical characteristics and cooperation networks.Research themes,research hotspots and future development trends are also investigated through the keyword analysis.Findings-The research on Apache Hadoop is still the top priority in the future,and how to improve the performance of Apache Hadoop in the era of big data is one of the research hotspots.Research limitations/implications-This paper makes a comprehensive analysis of Apache Hadoop with methods of bibliometrics,and it is valuable for researchers can quickly grasp the hot topics in this area.Originality/value-This paper draws the structural characteristics of the publications in this field and summarizes the research hotspots and trends in this field in recent years,aiming to understand the development status and trends in this field and inspire new ideas for researchers.展开更多
Standalone systems cannot handle the giant traffic loads generated by Twitter due to memory constraints.A parallel computational environment pro-vided by Apache Hadoop can distribute and process the data over differen...Standalone systems cannot handle the giant traffic loads generated by Twitter due to memory constraints.A parallel computational environment pro-vided by Apache Hadoop can distribute and process the data over different desti-nation systems.In this paper,the Hadoop cluster with four nodes integrated with RHadoop,Flume,and Hive is created to analyze the tweets gathered from the Twitter stream.Twitter stream data is collected relevant to an event/topic like IPL-2015,cricket,Royal Challengers Bangalore,Kohli,Modi,from May 24 to 30,2016 using Flume.Hive is used as a data warehouse to store the streamed tweets.Twitter analytics like maximum number of tweets by users,the average number of followers,and maximum number of friends are obtained using Hive.The network graph is constructed with the user’s unique screen name and men-tions using‘R’.A timeline graph of individual users is generated using‘R’.Also,the proposed solution analyses the emotions of cricket fans by classifying their Twitter messages into appropriate emotional categories using the optimized sup-port vector neural network(OSVNN)classification model.To attain better classi-fication accuracy,the performance of SVNN is enhanced using a chimp optimization algorithm(ChOA).Extracting the users’emotions toward an event is beneficial for prediction,but when coupled with visualizations,it becomes more powerful.Bar-chart and wordcloud are generated to visualize the emotional analysis results.展开更多
Big data refer to the massive amounts and varieties of information in the structured and unstructured form,generated by social networking sites,biomedical equipment,financial companies,internet and websites,scientific...Big data refer to the massive amounts and varieties of information in the structured and unstructured form,generated by social networking sites,biomedical equipment,financial companies,internet and websites,scientific sensors,agriculture engineering sources,and so on.This huge amount of data cannot be processed using traditional data processing systems and technologies.Big data analytics is a process of examining information and patterns from huge data.Hence,the process needs a system architecture for data collection,transmission,storage,processing and analysis,and visualization mechanisms.In this paper,we review the background and futuristic aspects of big data.We first introduce the history,background and related technologies of big data.We focus on big data system architecture,phases and classes of big data analytics.Then we present an open source big data framework to address some of the big data challenges.Finally,we discuss different applications of big data with some examples.展开更多
As an important service model for advanced computing,SaaS uses a defined protocol that manages services and applications.The popularity of advanced computing has reached a level that has led to the generation of large...As an important service model for advanced computing,SaaS uses a defined protocol that manages services and applications.The popularity of advanced computing has reached a level that has led to the generation of large data sets,which is also called Big data.Big data is evolving with great velocity,large volumes,and great diversity.Such an amplification of data has brought into question the existing database tools in terms of their capabilities.Previously,storage and processing of data were simple tasks;however,it is now one of the biggest challenges in the industry.Experts are paying close attention to big data.Designing a system capable of storing and analyzing such data in order to extract meaningful information for decision-making is a priority.The Apache Hadoop,Spark,and NoSQL databases are some of the core technologies that are being used to solve these issues.This paper contributes to the solutions to the issues of big data storage and processing.It presents an analysis of the current technologies in the industry that could be useful in this context.Efforts have been focused on implementing a novel Trinity model,which is built using the lambda architecture with the following technologies:Hadoop,Spark,Kafka,and MongoDB.展开更多
文摘The data and internet are highly growing which causes problems in management of the big-data.For these kinds of problems,there are many software frameworks used to increase the performance of the distributed system.This software is used for the availability of large data storage.One of the most beneficial software frameworks used to utilize data in distributed systems is Hadoop.This paper introduces Apache Hadoop architecture,components of Hadoop,their significance in managing vast volumes of data in a distributed system.Hadoop Distributed File System enables the storage of enormous chunks of data over a distributed network.Hadoop Framework maintains fsImage and edits files,which supports the availability and integrity of data.This paper includes cases of Hadoop implementation,such as monitoring weather,processing bioinformatics.
文摘Purpose-The purpose of this paper is to make an overview of 6,618 publications of Apache Hadoop from 2008 to 2020 in order to provide a conclusive and comprehensive analysis for researchers in this field,as well as a preliminary knowledge of Apache Hadoop for interested researchers.Design/methodology/approach-This paper employs the bibliometric analysis and visual analysis approaches to systematically study and analyze publications about Apache Hadoop in the Web of Science database.This study aims to investigate the topic of Apache Hadoop by means of bibliometric analysis with the aid of visualization applications.Through the bibliometric analysis of the collected documents,this paper analyzes the main statistical characteristics and cooperation networks.Research themes,research hotspots and future development trends are also investigated through the keyword analysis.Findings-The research on Apache Hadoop is still the top priority in the future,and how to improve the performance of Apache Hadoop in the era of big data is one of the research hotspots.Research limitations/implications-This paper makes a comprehensive analysis of Apache Hadoop with methods of bibliometrics,and it is valuable for researchers can quickly grasp the hot topics in this area.Originality/value-This paper draws the structural characteristics of the publications in this field and summarizes the research hotspots and trends in this field in recent years,aiming to understand the development status and trends in this field and inspire new ideas for researchers.
文摘Standalone systems cannot handle the giant traffic loads generated by Twitter due to memory constraints.A parallel computational environment pro-vided by Apache Hadoop can distribute and process the data over different desti-nation systems.In this paper,the Hadoop cluster with four nodes integrated with RHadoop,Flume,and Hive is created to analyze the tweets gathered from the Twitter stream.Twitter stream data is collected relevant to an event/topic like IPL-2015,cricket,Royal Challengers Bangalore,Kohli,Modi,from May 24 to 30,2016 using Flume.Hive is used as a data warehouse to store the streamed tweets.Twitter analytics like maximum number of tweets by users,the average number of followers,and maximum number of friends are obtained using Hive.The network graph is constructed with the user’s unique screen name and men-tions using‘R’.A timeline graph of individual users is generated using‘R’.Also,the proposed solution analyses the emotions of cricket fans by classifying their Twitter messages into appropriate emotional categories using the optimized sup-port vector neural network(OSVNN)classification model.To attain better classi-fication accuracy,the performance of SVNN is enhanced using a chimp optimization algorithm(ChOA).Extracting the users’emotions toward an event is beneficial for prediction,but when coupled with visualizations,it becomes more powerful.Bar-chart and wordcloud are generated to visualize the emotional analysis results.
文摘Big data refer to the massive amounts and varieties of information in the structured and unstructured form,generated by social networking sites,biomedical equipment,financial companies,internet and websites,scientific sensors,agriculture engineering sources,and so on.This huge amount of data cannot be processed using traditional data processing systems and technologies.Big data analytics is a process of examining information and patterns from huge data.Hence,the process needs a system architecture for data collection,transmission,storage,processing and analysis,and visualization mechanisms.In this paper,we review the background and futuristic aspects of big data.We first introduce the history,background and related technologies of big data.We focus on big data system architecture,phases and classes of big data analytics.Then we present an open source big data framework to address some of the big data challenges.Finally,we discuss different applications of big data with some examples.
文摘As an important service model for advanced computing,SaaS uses a defined protocol that manages services and applications.The popularity of advanced computing has reached a level that has led to the generation of large data sets,which is also called Big data.Big data is evolving with great velocity,large volumes,and great diversity.Such an amplification of data has brought into question the existing database tools in terms of their capabilities.Previously,storage and processing of data were simple tasks;however,it is now one of the biggest challenges in the industry.Experts are paying close attention to big data.Designing a system capable of storing and analyzing such data in order to extract meaningful information for decision-making is a priority.The Apache Hadoop,Spark,and NoSQL databases are some of the core technologies that are being used to solve these issues.This paper contributes to the solutions to the issues of big data storage and processing.It presents an analysis of the current technologies in the industry that could be useful in this context.Efforts have been focused on implementing a novel Trinity model,which is built using the lambda architecture with the following technologies:Hadoop,Spark,Kafka,and MongoDB.