The default scheduler of Apache Hadoop demonstrates operational inefficiencies when connecting external sources and processing transformation jobs.This paper has proposed a novel scheduler for enhancement of the perfo...The default scheduler of Apache Hadoop demonstrates operational inefficiencies when connecting external sources and processing transformation jobs.This paper has proposed a novel scheduler for enhancement of the performance of the Hadoop Yet Another Resource Negotiator(YARN)scheduler,called the Adaptive Node and Container Aware Scheduler(ANACRAC),that aligns cluster resources to the demands of the applications in the real world.The approach performs to leverage the user-provided configurations as a unique design to apportion nodes,or containers within the nodes,to application thresholds.Additionally,it provides the flexibility to the applications for selecting and choosing which node’s resources they want to manage and adds limits to prevent threshold breaches by adding additional jobs as needed.Node or container awareness can be utilized individually or in combination to increase efficiency.On top of this,the resource availability within the node and containers can also be investigated.This paper also focuses on the elasticity of the containers and self-adaptiveness depending on the job type.The results proved that 15%–20%performance improvement was achieved compared with the node and container awareness feature of the ANACRAC.It has been validated that this ANACRAC scheduler demonstrates a 70%–90%performance improvement compared with the default Fair scheduler.Experimental results also demonstrated the success of the enhancement and a performance improvement in the range of 60%to 200%when applications were connected with external interfaces and high workloads.展开更多
As the importance of email increases,the amount of malicious email is also increasing,so the need for malicious email filtering is growing.Since it is more economical to combine commodity hardware consisting of a medi...As the importance of email increases,the amount of malicious email is also increasing,so the need for malicious email filtering is growing.Since it is more economical to combine commodity hardware consisting of a medium server or PC with a virtual environment to use as a single server resource and filter malicious email using machine learning techniques,we used a Hadoop MapReduce framework and Naïve Bayes among machine learning methods for malicious email filtering.Naïve Bayes was selected because it is one of the top machine learning methods(Support Vector Machine(SVM),Naïve Bayes,K-Nearest Neighbor(KNN),and Decision Tree)in terms of execution time and accuracy.Malicious email was filtered with MapReduce programming using the Naïve Bayes technique,which is a supervised machine learning method,in a Hadoop framework with optimized performance and also with the Python program technique with the Naïve Bayes technique applied in a bare metal server environment with the Hadoop environment not applied.According to the results of a comparison of the accuracy and predictive error rates of the two methods,the Hadoop MapReduce Naïve Bayes method improved the accuracy of spam and ham email identification 1.11 times and the prediction error rate 14.13 times compared to the non-Hadoop Python Naïve Bayes method.展开更多
Hadoop Distributed File System(HDFS)is one of the widely used distributed file systems in big data analysis for frameworks such as Hadoop.HDFS allows one to manage large volumes of data using low-cost commodity hardwa...Hadoop Distributed File System(HDFS)is one of the widely used distributed file systems in big data analysis for frameworks such as Hadoop.HDFS allows one to manage large volumes of data using low-cost commodity hardware.However,vulnerabilities in HDFS can be exploited for nefarious activities.This reinforces the importance of ensuring robust security to facilitate file sharing in Hadoop as well as having a trusted mechanism to check the authenticity of shared files.This is the focus of this paper,where we aim to improve the security of HDFS using a blockchain-enabled approach(hereafter referred to as BlockHDFS).Specifically,the proposed BlockHDFS uses the enterprise-level Hyperledger Fabric platform to capitalize on files'metadata for building trusted data security and traceability in HDFS.展开更多
The large scale and distribution of cloud computing storage have become the major challenges in cloud forensics for file extraction. Current disk forensic methods do not adapt to cloud computing well and the forensic ...The large scale and distribution of cloud computing storage have become the major challenges in cloud forensics for file extraction. Current disk forensic methods do not adapt to cloud computing well and the forensic research on distributed file system is inadequate. To address the forensic problems, this paper uses the Hadoop distributed file system (HDFS) as a case study and proposes a forensic method for efficient file extraction based on three-level (3L) mapping. First, HDFS is analyzed from overall architecture to local file system. Second, the 3L mapping of an HDFS file from HDFS namespace to data blocks on local file system is established and a recovery method for deleted files based on 3L mapping is presented. Third, a multi-node Hadoop framework via Xen virtualization platform is set up to test the performance of the method. The results indicate that the proposed method could succeed in efficient location of large files stored across data nodes, make selective image of disk data and get high recovery rate of deleted files.展开更多
A large volume of Remote Sensing(RS)data has been generated with the deployment of satellite technologies.The data facilitate research in ecological monitoring,land management and desertification,etc.The characteristi...A large volume of Remote Sensing(RS)data has been generated with the deployment of satellite technologies.The data facilitate research in ecological monitoring,land management and desertification,etc.The characteristics of RS data(e.g.,enormous volume,large single-file size,and demanding requirement of fault tolerance)make the Hadoop Distributed File System(HDFS)an ideal choice for RS data storage as it is efficient,scalable,and equipped with a data replication mechanism for failure resilience.To use RS data,one of the most important techniques is geospatial indexing.However,the large data volume makes it time-consuming to efficiently construct and leverage.Considering that most modern geospatial data centres are equipped with HDFS-based big data processing infrastructures,deploying multiple geospatial indices becomes natural to optimise the efficacy.Moreover,because of the reliability introduced by high-quality hardware and the infrequently modified property of the RS data,the use of multi-indexing will not cause large overhead.Therefore,we design a framework called Multi-IndeXing-RS(MIX-RS)that unifies the multi-indexing mechanism on top of the HDFS with data replication enabled for both fault tolerance and geospatial indexing efficiency.Given the fault tolerance provided by the HDFS,RS data are structurally stored inside for faster geospatial indexing.Additionally,multi-indexing enhances efficiency.The proposed technique naturally sits on top of the HDFS to form a holistic framework without incurring severe overhead or sophisticated system implementation efforts.The MIX-RS framework is implemented and evaluated using real remote sensing data provided by the Chinese Academy of Sciences,demonstrating excellent geospatial indexing performance.展开更多
China Unicorn, the largest WCDMA 3G operator in China, meets the requirements of the historical Mobile Internet Explosion, or the surging of Mobile Internet Traffic from mobile terminals. According to the internal sta...China Unicorn, the largest WCDMA 3G operator in China, meets the requirements of the historical Mobile Internet Explosion, or the surging of Mobile Internet Traffic from mobile terminals. According to the internal statistics of China Unicom, mobile user traffic has increased rapidly with a Compound Annual Growth Rate (CAGR) of 135%. Currently China Unicorn monthly stores more than 2 trillion records, data volume is over 525 TB, and the highest data volume has reached a peak of 5 PB. Since October 2009, China Unicom has been developing a home-brewed big data storage and analysis platform based on the open source Hadoop Distributed File System (HDFS) as it has a long-term strategy to make full use of this Big Data. All Mobile Internet Traffic is well served using this big data platform. Currently, the writing speed has reached 1 390 000 records per second, and the record retrieval time in the table that contains trillions of records is less than 100 ms. To take advantage of this opportunity to be a Big Data Operator, China Unicom has developed new functions and has multiple innovations to solve space and time constraint challenges presented in data processing. In this paper, we will introduce our big data platform in detail. Based on this big data platform, China Unicom is building an industry ecosystem based on Mobile Internet Big Data, and considers that a telecom operator centric ecosystem can be formed that is critical to reach prosperity in the modern communications business.展开更多
文摘The default scheduler of Apache Hadoop demonstrates operational inefficiencies when connecting external sources and processing transformation jobs.This paper has proposed a novel scheduler for enhancement of the performance of the Hadoop Yet Another Resource Negotiator(YARN)scheduler,called the Adaptive Node and Container Aware Scheduler(ANACRAC),that aligns cluster resources to the demands of the applications in the real world.The approach performs to leverage the user-provided configurations as a unique design to apportion nodes,or containers within the nodes,to application thresholds.Additionally,it provides the flexibility to the applications for selecting and choosing which node’s resources they want to manage and adds limits to prevent threshold breaches by adding additional jobs as needed.Node or container awareness can be utilized individually or in combination to increase efficiency.On top of this,the resource availability within the node and containers can also be investigated.This paper also focuses on the elasticity of the containers and self-adaptiveness depending on the job type.The results proved that 15%–20%performance improvement was achieved compared with the node and container awareness feature of the ANACRAC.It has been validated that this ANACRAC scheduler demonstrates a 70%–90%performance improvement compared with the default Fair scheduler.Experimental results also demonstrated the success of the enhancement and a performance improvement in the range of 60%to 200%when applications were connected with external interfaces and high workloads.
文摘As the importance of email increases,the amount of malicious email is also increasing,so the need for malicious email filtering is growing.Since it is more economical to combine commodity hardware consisting of a medium server or PC with a virtual environment to use as a single server resource and filter malicious email using machine learning techniques,we used a Hadoop MapReduce framework and Naïve Bayes among machine learning methods for malicious email filtering.Naïve Bayes was selected because it is one of the top machine learning methods(Support Vector Machine(SVM),Naïve Bayes,K-Nearest Neighbor(KNN),and Decision Tree)in terms of execution time and accuracy.Malicious email was filtered with MapReduce programming using the Naïve Bayes technique,which is a supervised machine learning method,in a Hadoop framework with optimized performance and also with the Python program technique with the Naïve Bayes technique applied in a bare metal server environment with the Hadoop environment not applied.According to the results of a comparison of the accuracy and predictive error rates of the two methods,the Hadoop MapReduce Naïve Bayes method improved the accuracy of spam and ham email identification 1.11 times and the prediction error rate 14.13 times compared to the non-Hadoop Python Naïve Bayes method.
文摘Hadoop Distributed File System(HDFS)is one of the widely used distributed file systems in big data analysis for frameworks such as Hadoop.HDFS allows one to manage large volumes of data using low-cost commodity hardware.However,vulnerabilities in HDFS can be exploited for nefarious activities.This reinforces the importance of ensuring robust security to facilitate file sharing in Hadoop as well as having a trusted mechanism to check the authenticity of shared files.This is the focus of this paper,where we aim to improve the security of HDFS using a blockchain-enabled approach(hereafter referred to as BlockHDFS).Specifically,the proposed BlockHDFS uses the enterprise-level Hyperledger Fabric platform to capitalize on files'metadata for building trusted data security and traceability in HDFS.
基金Supported by the National High Technology Research and Development Program of China(863 Program)(2015AA016006)the National Natural Science Foundation of China(60903220)
文摘The large scale and distribution of cloud computing storage have become the major challenges in cloud forensics for file extraction. Current disk forensic methods do not adapt to cloud computing well and the forensic research on distributed file system is inadequate. To address the forensic problems, this paper uses the Hadoop distributed file system (HDFS) as a case study and proposes a forensic method for efficient file extraction based on three-level (3L) mapping. First, HDFS is analyzed from overall architecture to local file system. Second, the 3L mapping of an HDFS file from HDFS namespace to data blocks on local file system is established and a recovery method for deleted files based on 3L mapping is presented. Third, a multi-node Hadoop framework via Xen virtualization platform is set up to test the performance of the method. The results indicate that the proposed method could succeed in efficient location of large files stored across data nodes, make selective image of disk data and get high recovery rate of deleted files.
基金supported in part by Key-Area Research and Development Program of Guangdong Province(No.2020B010164002)the Fundamental Research Foundation of Shenzhen Technology and Innovation Council(No.KCXFZ20201221173613035).
文摘A large volume of Remote Sensing(RS)data has been generated with the deployment of satellite technologies.The data facilitate research in ecological monitoring,land management and desertification,etc.The characteristics of RS data(e.g.,enormous volume,large single-file size,and demanding requirement of fault tolerance)make the Hadoop Distributed File System(HDFS)an ideal choice for RS data storage as it is efficient,scalable,and equipped with a data replication mechanism for failure resilience.To use RS data,one of the most important techniques is geospatial indexing.However,the large data volume makes it time-consuming to efficiently construct and leverage.Considering that most modern geospatial data centres are equipped with HDFS-based big data processing infrastructures,deploying multiple geospatial indices becomes natural to optimise the efficacy.Moreover,because of the reliability introduced by high-quality hardware and the infrequently modified property of the RS data,the use of multi-indexing will not cause large overhead.Therefore,we design a framework called Multi-IndeXing-RS(MIX-RS)that unifies the multi-indexing mechanism on top of the HDFS with data replication enabled for both fault tolerance and geospatial indexing efficiency.Given the fault tolerance provided by the HDFS,RS data are structurally stored inside for faster geospatial indexing.Additionally,multi-indexing enhances efficiency.The proposed technique naturally sits on top of the HDFS to form a holistic framework without incurring severe overhead or sophisticated system implementation efforts.The MIX-RS framework is implemented and evaluated using real remote sensing data provided by the Chinese Academy of Sciences,demonstrating excellent geospatial indexing performance.
基金supported in part by the National Key Basic Research and Development(973)Program of China(Nos.2013CB228206 and 2012CB315801)the National Natural Science Foundation of China(Nos.61233016 and 61140320)supported by the Intel Research Council under the title of"Security Vulnerability Analysis Based on Cloud Platform with Intel IA Architecture"
文摘China Unicorn, the largest WCDMA 3G operator in China, meets the requirements of the historical Mobile Internet Explosion, or the surging of Mobile Internet Traffic from mobile terminals. According to the internal statistics of China Unicom, mobile user traffic has increased rapidly with a Compound Annual Growth Rate (CAGR) of 135%. Currently China Unicorn monthly stores more than 2 trillion records, data volume is over 525 TB, and the highest data volume has reached a peak of 5 PB. Since October 2009, China Unicom has been developing a home-brewed big data storage and analysis platform based on the open source Hadoop Distributed File System (HDFS) as it has a long-term strategy to make full use of this Big Data. All Mobile Internet Traffic is well served using this big data platform. Currently, the writing speed has reached 1 390 000 records per second, and the record retrieval time in the table that contains trillions of records is less than 100 ms. To take advantage of this opportunity to be a Big Data Operator, China Unicom has developed new functions and has multiple innovations to solve space and time constraint challenges presented in data processing. In this paper, we will introduce our big data platform in detail. Based on this big data platform, China Unicom is building an industry ecosystem based on Mobile Internet Big Data, and considers that a telecom operator centric ecosystem can be formed that is critical to reach prosperity in the modern communications business.