Hadoop Distributed File System(HDFS)is one of the widely used distributed file systems in big data analysis for frameworks such as Hadoop.HDFS allows one to manage large volumes of data using low-cost commodity hardwa...Hadoop Distributed File System(HDFS)is one of the widely used distributed file systems in big data analysis for frameworks such as Hadoop.HDFS allows one to manage large volumes of data using low-cost commodity hardware.However,vulnerabilities in HDFS can be exploited for nefarious activities.This reinforces the importance of ensuring robust security to facilitate file sharing in Hadoop as well as having a trusted mechanism to check the authenticity of shared files.This is the focus of this paper,where we aim to improve the security of HDFS using a blockchain-enabled approach(hereafter referred to as BlockHDFS).Specifically,the proposed BlockHDFS uses the enterprise-level Hyperledger Fabric platform to capitalize on files'metadata for building trusted data security and traceability in HDFS.展开更多
An adaptive dynamic load balancing algorithm based on QoS is proposed to improve the performance of load balancing in distributed file system,combining the advantages of a variety of load balancing algorithms.The new ...An adaptive dynamic load balancing algorithm based on QoS is proposed to improve the performance of load balancing in distributed file system,combining the advantages of a variety of load balancing algorithms.The new algorithm uses a tuple containing the number of files and the total file size as the QoS measure for the requested task.The master node sets a threshold for the requested task based on the QoS to filter storage nodes that meet the requirements of the task.In order to guarantee the reliability of the new algorithm,we consider the impact of CPU utilization,memory usage,disk IO occupancy rate,network bandwidth usage and hard disk usage on load balancing performance when calculating the real-time load balancing of storage nodes.The heterogeneity of the network is considered when the master node schedule task assignments to ensure the fairness of the algorithm.The comprehensive evaluation value is determined based the performance load ratio,which is calculated from the real-time load value of the storage node and a performance value after normalization.The master node assigns tasks to the storage node with the highest comprehensive evaluation value.The storage nodes provide adaptive feedback based on changes in the degree of connectivity,rather than periodic update of the load information.The actual distributed file system environment is set up on the server cluster,the performance of the new algorithm is tested through a contrast experiment.The experimental results show that the new algorithm can effectively reduce the average response time of the system,improve throughput,and enable the system load to reach a good balance.展开更多
The rapid development of Internet of Things(IoT)technology has made previously unavailable data available,and applications can take advantage of device data for people to visualize,explore,and build complex analyses.A...The rapid development of Internet of Things(IoT)technology has made previously unavailable data available,and applications can take advantage of device data for people to visualize,explore,and build complex analyses.As the size of the network and the number of network users continue to increase,network requests tend to aggregate on a small number of network resources,which results in uneven load on network requests.Real-time,highly reliable network file distribution technology is of great importance in the Internet of Things.This paper studies real-time and highly reliable file distribution technology for large-scale networks.In response to this topic,this paper studies the current file distribution technology,proposes a file distribution model,and proposes a corresponding load balancing method based on the file distribution model.Experiments show that the system has achieved real-time and high reliability of network transmission.展开更多
The default scheduler of Apache Hadoop demonstrates operational inefficiencies when connecting external sources and processing transformation jobs.This paper has proposed a novel scheduler for enhancement of the perfo...The default scheduler of Apache Hadoop demonstrates operational inefficiencies when connecting external sources and processing transformation jobs.This paper has proposed a novel scheduler for enhancement of the performance of the Hadoop Yet Another Resource Negotiator(YARN)scheduler,called the Adaptive Node and Container Aware Scheduler(ANACRAC),that aligns cluster resources to the demands of the applications in the real world.The approach performs to leverage the user-provided configurations as a unique design to apportion nodes,or containers within the nodes,to application thresholds.Additionally,it provides the flexibility to the applications for selecting and choosing which node’s resources they want to manage and adds limits to prevent threshold breaches by adding additional jobs as needed.Node or container awareness can be utilized individually or in combination to increase efficiency.On top of this,the resource availability within the node and containers can also be investigated.This paper also focuses on the elasticity of the containers and self-adaptiveness depending on the job type.The results proved that 15%–20%performance improvement was achieved compared with the node and container awareness feature of the ANACRAC.It has been validated that this ANACRAC scheduler demonstrates a 70%–90%performance improvement compared with the default Fair scheduler.Experimental results also demonstrated the success of the enhancement and a performance improvement in the range of 60%to 200%when applications were connected with external interfaces and high workloads.展开更多
As the importance of email increases,the amount of malicious email is also increasing,so the need for malicious email filtering is growing.Since it is more economical to combine commodity hardware consisting of a medi...As the importance of email increases,the amount of malicious email is also increasing,so the need for malicious email filtering is growing.Since it is more economical to combine commodity hardware consisting of a medium server or PC with a virtual environment to use as a single server resource and filter malicious email using machine learning techniques,we used a Hadoop MapReduce framework and Naïve Bayes among machine learning methods for malicious email filtering.Naïve Bayes was selected because it is one of the top machine learning methods(Support Vector Machine(SVM),Naïve Bayes,K-Nearest Neighbor(KNN),and Decision Tree)in terms of execution time and accuracy.Malicious email was filtered with MapReduce programming using the Naïve Bayes technique,which is a supervised machine learning method,in a Hadoop framework with optimized performance and also with the Python program technique with the Naïve Bayes technique applied in a bare metal server environment with the Hadoop environment not applied.According to the results of a comparison of the accuracy and predictive error rates of the two methods,the Hadoop MapReduce Naïve Bayes method improved the accuracy of spam and ham email identification 1.11 times and the prediction error rate 14.13 times compared to the non-Hadoop Python Naïve Bayes method.展开更多
A large volume of Remote Sensing(RS)data has been generated with the deployment of satellite technologies.The data facilitate research in ecological monitoring,land management and desertification,etc.The characteristi...A large volume of Remote Sensing(RS)data has been generated with the deployment of satellite technologies.The data facilitate research in ecological monitoring,land management and desertification,etc.The characteristics of RS data(e.g.,enormous volume,large single-file size,and demanding requirement of fault tolerance)make the Hadoop Distributed File System(HDFS)an ideal choice for RS data storage as it is efficient,scalable,and equipped with a data replication mechanism for failure resilience.To use RS data,one of the most important techniques is geospatial indexing.However,the large data volume makes it time-consuming to efficiently construct and leverage.Considering that most modern geospatial data centres are equipped with HDFS-based big data processing infrastructures,deploying multiple geospatial indices becomes natural to optimise the efficacy.Moreover,because of the reliability introduced by high-quality hardware and the infrequently modified property of the RS data,the use of multi-indexing will not cause large overhead.Therefore,we design a framework called Multi-IndeXing-RS(MIX-RS)that unifies the multi-indexing mechanism on top of the HDFS with data replication enabled for both fault tolerance and geospatial indexing efficiency.Given the fault tolerance provided by the HDFS,RS data are structurally stored inside for faster geospatial indexing.Additionally,multi-indexing enhances efficiency.The proposed technique naturally sits on top of the HDFS to form a holistic framework without incurring severe overhead or sophisticated system implementation efforts.The MIX-RS framework is implemented and evaluated using real remote sensing data provided by the Chinese Academy of Sciences,demonstrating excellent geospatial indexing performance.展开更多
The large scale and distribution of cloud computing storage have become the major challenges in cloud forensics for file extraction. Current disk forensic methods do not adapt to cloud computing well and the forensic ...The large scale and distribution of cloud computing storage have become the major challenges in cloud forensics for file extraction. Current disk forensic methods do not adapt to cloud computing well and the forensic research on distributed file system is inadequate. To address the forensic problems, this paper uses the Hadoop distributed file system (HDFS) as a case study and proposes a forensic method for efficient file extraction based on three-level (3L) mapping. First, HDFS is analyzed from overall architecture to local file system. Second, the 3L mapping of an HDFS file from HDFS namespace to data blocks on local file system is established and a recovery method for deleted files based on 3L mapping is presented. Third, a multi-node Hadoop framework via Xen virtualization platform is set up to test the performance of the method. The results indicate that the proposed method could succeed in efficient location of large files stored across data nodes, make selective image of disk data and get high recovery rate of deleted files.展开更多
The archiving of Internet traffic is an essential function for retrospective network event analysis and forensic computer communication. The state-of-the-art approach for network monitoring and analysis involves stora...The archiving of Internet traffic is an essential function for retrospective network event analysis and forensic computer communication. The state-of-the-art approach for network monitoring and analysis involves storage and analysis of network flow statistic. However, this approach loses much valuable information within the Internet traffic. With the advancement of commodity hardware, in particular the volume of storage devices and the speed of interconnect technologies used in network adapter cards and multi-core processors, it is now possible to capture 10 Gbps and beyond real-time network traffic using a commodity computer, such as n2disk. Also with the advancement of distributed file system (such as Hadoop, ZFS, etc.) and open cloud computing platform (such as OpenStack, CloudStack, and Eucalyptus, etc.), it is practical to store such large volume of traffic data and fully in-depth analyse the inside communication within an acceptable latency. In this paper, based on well- known TimeMachine, we present TIFAflow, the design and implementation of a novel system for archiving and querying network flows. Firstly, we enhance the traffic archiving system named TImemachine+FAstbit (TIFA) with flow granularity, i.e., supply the system with flow table and flow module. Secondly, based on real network traces, we conduct performance comparison experiments of TIFAflow with other implementations such as common database solution, TimeMachine and TIFA system. Finally, based on comparison results, we demonstrate that TIFAflow has a higher performance improvement in storing and querying performance than TimeMachine and TIFA, both in time and space metrics.展开更多
China Unicorn, the largest WCDMA 3G operator in China, meets the requirements of the historical Mobile Internet Explosion, or the surging of Mobile Internet Traffic from mobile terminals. According to the internal sta...China Unicorn, the largest WCDMA 3G operator in China, meets the requirements of the historical Mobile Internet Explosion, or the surging of Mobile Internet Traffic from mobile terminals. According to the internal statistics of China Unicom, mobile user traffic has increased rapidly with a Compound Annual Growth Rate (CAGR) of 135%. Currently China Unicorn monthly stores more than 2 trillion records, data volume is over 525 TB, and the highest data volume has reached a peak of 5 PB. Since October 2009, China Unicom has been developing a home-brewed big data storage and analysis platform based on the open source Hadoop Distributed File System (HDFS) as it has a long-term strategy to make full use of this Big Data. All Mobile Internet Traffic is well served using this big data platform. Currently, the writing speed has reached 1 390 000 records per second, and the record retrieval time in the table that contains trillions of records is less than 100 ms. To take advantage of this opportunity to be a Big Data Operator, China Unicom has developed new functions and has multiple innovations to solve space and time constraint challenges presented in data processing. In this paper, we will introduce our big data platform in detail. Based on this big data platform, China Unicom is building an industry ecosystem based on Mobile Internet Big Data, and considers that a telecom operator centric ecosystem can be formed that is critical to reach prosperity in the modern communications business.展开更多
With the explosive increase in mobile apps, more and more threats migrate from traditional PC client to mobile device. Compared with traditional Win+Intel alliance in PC, Android+ARM alliance dominates in Mobile Int...With the explosive increase in mobile apps, more and more threats migrate from traditional PC client to mobile device. Compared with traditional Win+Intel alliance in PC, Android+ARM alliance dominates in Mobile Internet, the apps replace the PC client software as the major target of malicious usage. In this paper, to improve the security status of current mobile apps, we propose a methodology to evaluate mobile apps based on cloud computing platform and data mining. We also present a prototype system named MobSafe to identify the mobile app's virulence or benignancy. Compared with traditional method, such as permission pattern based method, MobSafe combines the dynamic and static analysis methods to comprehensively evaluate an Android app. In the implementation, we adopt Android Security Evaluation Framework (ASEF) and Static Android Analysis Framework (SAAF), the two representative dynamic and static analysis methods, to evaluate the Android apps and estimate the total time needed to evaluate all the apps stored in one mobile app market. Based on the real trace from a commercial mobile app market called AppChina, we can collect the statistics of the number of active Android apps, the average number apps installed in one Android device, and the expanding ratio of mobile apps. As mobile app market serves as the main line of defence against mobile malwares, our evaluation results show that it is practical to use cloud computing platform and data mining to verify all stored apps routinely to filter out malware apps from mobile app markets. As the future work, MobSafe can extensively use machine learning to conduct automotive forensic analysis of mobile apps based on the generated multifaceted data in this stage.展开更多
文摘Hadoop Distributed File System(HDFS)is one of the widely used distributed file systems in big data analysis for frameworks such as Hadoop.HDFS allows one to manage large volumes of data using low-cost commodity hardware.However,vulnerabilities in HDFS can be exploited for nefarious activities.This reinforces the importance of ensuring robust security to facilitate file sharing in Hadoop as well as having a trusted mechanism to check the authenticity of shared files.This is the focus of this paper,where we aim to improve the security of HDFS using a blockchain-enabled approach(hereafter referred to as BlockHDFS).Specifically,the proposed BlockHDFS uses the enterprise-level Hyperledger Fabric platform to capitalize on files'metadata for building trusted data security and traceability in HDFS.
基金supported in part by the National Basic Research Program of China("973"Program)(No.2013CB329102).
文摘An adaptive dynamic load balancing algorithm based on QoS is proposed to improve the performance of load balancing in distributed file system,combining the advantages of a variety of load balancing algorithms.The new algorithm uses a tuple containing the number of files and the total file size as the QoS measure for the requested task.The master node sets a threshold for the requested task based on the QoS to filter storage nodes that meet the requirements of the task.In order to guarantee the reliability of the new algorithm,we consider the impact of CPU utilization,memory usage,disk IO occupancy rate,network bandwidth usage and hard disk usage on load balancing performance when calculating the real-time load balancing of storage nodes.The heterogeneity of the network is considered when the master node schedule task assignments to ensure the fairness of the algorithm.The comprehensive evaluation value is determined based the performance load ratio,which is calculated from the real-time load value of the storage node and a performance value after normalization.The master node assigns tasks to the storage node with the highest comprehensive evaluation value.The storage nodes provide adaptive feedback based on changes in the degree of connectivity,rather than periodic update of the load information.The actual distributed file system environment is set up on the server cluster,the performance of the new algorithm is tested through a contrast experiment.The experimental results show that the new algorithm can effectively reduce the average response time of the system,improve throughput,and enable the system load to reach a good balance.
基金This work was supported by National Key Research&Development Plan of China under Grant 2016QY05X1000National Natural Science Foundation of China under Grant No.61771166CERNET Innovation Project(NGII20170412).
文摘The rapid development of Internet of Things(IoT)technology has made previously unavailable data available,and applications can take advantage of device data for people to visualize,explore,and build complex analyses.As the size of the network and the number of network users continue to increase,network requests tend to aggregate on a small number of network resources,which results in uneven load on network requests.Real-time,highly reliable network file distribution technology is of great importance in the Internet of Things.This paper studies real-time and highly reliable file distribution technology for large-scale networks.In response to this topic,this paper studies the current file distribution technology,proposes a file distribution model,and proposes a corresponding load balancing method based on the file distribution model.Experiments show that the system has achieved real-time and high reliability of network transmission.
文摘The default scheduler of Apache Hadoop demonstrates operational inefficiencies when connecting external sources and processing transformation jobs.This paper has proposed a novel scheduler for enhancement of the performance of the Hadoop Yet Another Resource Negotiator(YARN)scheduler,called the Adaptive Node and Container Aware Scheduler(ANACRAC),that aligns cluster resources to the demands of the applications in the real world.The approach performs to leverage the user-provided configurations as a unique design to apportion nodes,or containers within the nodes,to application thresholds.Additionally,it provides the flexibility to the applications for selecting and choosing which node’s resources they want to manage and adds limits to prevent threshold breaches by adding additional jobs as needed.Node or container awareness can be utilized individually or in combination to increase efficiency.On top of this,the resource availability within the node and containers can also be investigated.This paper also focuses on the elasticity of the containers and self-adaptiveness depending on the job type.The results proved that 15%–20%performance improvement was achieved compared with the node and container awareness feature of the ANACRAC.It has been validated that this ANACRAC scheduler demonstrates a 70%–90%performance improvement compared with the default Fair scheduler.Experimental results also demonstrated the success of the enhancement and a performance improvement in the range of 60%to 200%when applications were connected with external interfaces and high workloads.
文摘As the importance of email increases,the amount of malicious email is also increasing,so the need for malicious email filtering is growing.Since it is more economical to combine commodity hardware consisting of a medium server or PC with a virtual environment to use as a single server resource and filter malicious email using machine learning techniques,we used a Hadoop MapReduce framework and Naïve Bayes among machine learning methods for malicious email filtering.Naïve Bayes was selected because it is one of the top machine learning methods(Support Vector Machine(SVM),Naïve Bayes,K-Nearest Neighbor(KNN),and Decision Tree)in terms of execution time and accuracy.Malicious email was filtered with MapReduce programming using the Naïve Bayes technique,which is a supervised machine learning method,in a Hadoop framework with optimized performance and also with the Python program technique with the Naïve Bayes technique applied in a bare metal server environment with the Hadoop environment not applied.According to the results of a comparison of the accuracy and predictive error rates of the two methods,the Hadoop MapReduce Naïve Bayes method improved the accuracy of spam and ham email identification 1.11 times and the prediction error rate 14.13 times compared to the non-Hadoop Python Naïve Bayes method.
基金supported in part by Key-Area Research and Development Program of Guangdong Province(No.2020B010164002)the Fundamental Research Foundation of Shenzhen Technology and Innovation Council(No.KCXFZ20201221173613035).
文摘A large volume of Remote Sensing(RS)data has been generated with the deployment of satellite technologies.The data facilitate research in ecological monitoring,land management and desertification,etc.The characteristics of RS data(e.g.,enormous volume,large single-file size,and demanding requirement of fault tolerance)make the Hadoop Distributed File System(HDFS)an ideal choice for RS data storage as it is efficient,scalable,and equipped with a data replication mechanism for failure resilience.To use RS data,one of the most important techniques is geospatial indexing.However,the large data volume makes it time-consuming to efficiently construct and leverage.Considering that most modern geospatial data centres are equipped with HDFS-based big data processing infrastructures,deploying multiple geospatial indices becomes natural to optimise the efficacy.Moreover,because of the reliability introduced by high-quality hardware and the infrequently modified property of the RS data,the use of multi-indexing will not cause large overhead.Therefore,we design a framework called Multi-IndeXing-RS(MIX-RS)that unifies the multi-indexing mechanism on top of the HDFS with data replication enabled for both fault tolerance and geospatial indexing efficiency.Given the fault tolerance provided by the HDFS,RS data are structurally stored inside for faster geospatial indexing.Additionally,multi-indexing enhances efficiency.The proposed technique naturally sits on top of the HDFS to form a holistic framework without incurring severe overhead or sophisticated system implementation efforts.The MIX-RS framework is implemented and evaluated using real remote sensing data provided by the Chinese Academy of Sciences,demonstrating excellent geospatial indexing performance.
基金Supported by the National High Technology Research and Development Program of China(863 Program)(2015AA016006)the National Natural Science Foundation of China(60903220)
文摘The large scale and distribution of cloud computing storage have become the major challenges in cloud forensics for file extraction. Current disk forensic methods do not adapt to cloud computing well and the forensic research on distributed file system is inadequate. To address the forensic problems, this paper uses the Hadoop distributed file system (HDFS) as a case study and proposes a forensic method for efficient file extraction based on three-level (3L) mapping. First, HDFS is analyzed from overall architecture to local file system. Second, the 3L mapping of an HDFS file from HDFS namespace to data blocks on local file system is established and a recovery method for deleted files based on 3L mapping is presented. Third, a multi-node Hadoop framework via Xen virtualization platform is set up to test the performance of the method. The results indicate that the proposed method could succeed in efficient location of large files stored across data nodes, make selective image of disk data and get high recovery rate of deleted files.
基金the National Key Basic Research and Development (973) Program of China (Nos. 2012CB315801 and 2011CB302805)the National Natural Science Foundation of China A3 Program (No. 61161140320) and the National Natural Science Foundation of China (No. 61233016)Intel Research Councils UPO program with title of security Vulnerability Analysis based on Cloud Platform with Intel IA Architecture
文摘The archiving of Internet traffic is an essential function for retrospective network event analysis and forensic computer communication. The state-of-the-art approach for network monitoring and analysis involves storage and analysis of network flow statistic. However, this approach loses much valuable information within the Internet traffic. With the advancement of commodity hardware, in particular the volume of storage devices and the speed of interconnect technologies used in network adapter cards and multi-core processors, it is now possible to capture 10 Gbps and beyond real-time network traffic using a commodity computer, such as n2disk. Also with the advancement of distributed file system (such as Hadoop, ZFS, etc.) and open cloud computing platform (such as OpenStack, CloudStack, and Eucalyptus, etc.), it is practical to store such large volume of traffic data and fully in-depth analyse the inside communication within an acceptable latency. In this paper, based on well- known TimeMachine, we present TIFAflow, the design and implementation of a novel system for archiving and querying network flows. Firstly, we enhance the traffic archiving system named TImemachine+FAstbit (TIFA) with flow granularity, i.e., supply the system with flow table and flow module. Secondly, based on real network traces, we conduct performance comparison experiments of TIFAflow with other implementations such as common database solution, TimeMachine and TIFA system. Finally, based on comparison results, we demonstrate that TIFAflow has a higher performance improvement in storing and querying performance than TimeMachine and TIFA, both in time and space metrics.
基金supported in part by the National Key Basic Research and Development(973)Program of China(Nos.2013CB228206 and 2012CB315801)the National Natural Science Foundation of China(Nos.61233016 and 61140320)supported by the Intel Research Council under the title of"Security Vulnerability Analysis Based on Cloud Platform with Intel IA Architecture"
文摘China Unicorn, the largest WCDMA 3G operator in China, meets the requirements of the historical Mobile Internet Explosion, or the surging of Mobile Internet Traffic from mobile terminals. According to the internal statistics of China Unicom, mobile user traffic has increased rapidly with a Compound Annual Growth Rate (CAGR) of 135%. Currently China Unicorn monthly stores more than 2 trillion records, data volume is over 525 TB, and the highest data volume has reached a peak of 5 PB. Since October 2009, China Unicom has been developing a home-brewed big data storage and analysis platform based on the open source Hadoop Distributed File System (HDFS) as it has a long-term strategy to make full use of this Big Data. All Mobile Internet Traffic is well served using this big data platform. Currently, the writing speed has reached 1 390 000 records per second, and the record retrieval time in the table that contains trillions of records is less than 100 ms. To take advantage of this opportunity to be a Big Data Operator, China Unicom has developed new functions and has multiple innovations to solve space and time constraint challenges presented in data processing. In this paper, we will introduce our big data platform in detail. Based on this big data platform, China Unicom is building an industry ecosystem based on Mobile Internet Big Data, and considers that a telecom operator centric ecosystem can be formed that is critical to reach prosperity in the modern communications business.
基金the National Key Basic Research and Development (973) Program of China (Nos. 2012CB315801 and 2011CB302805)the National Natural Science Foundation of China (Nos. 61161140320 and 61233016)Intel Research Council with the title of Security Vulnerability Analysis based on Cloud Platform with Intel IA Architecture
文摘With the explosive increase in mobile apps, more and more threats migrate from traditional PC client to mobile device. Compared with traditional Win+Intel alliance in PC, Android+ARM alliance dominates in Mobile Internet, the apps replace the PC client software as the major target of malicious usage. In this paper, to improve the security status of current mobile apps, we propose a methodology to evaluate mobile apps based on cloud computing platform and data mining. We also present a prototype system named MobSafe to identify the mobile app's virulence or benignancy. Compared with traditional method, such as permission pattern based method, MobSafe combines the dynamic and static analysis methods to comprehensively evaluate an Android app. In the implementation, we adopt Android Security Evaluation Framework (ASEF) and Static Android Analysis Framework (SAAF), the two representative dynamic and static analysis methods, to evaluate the Android apps and estimate the total time needed to evaluate all the apps stored in one mobile app market. Based on the real trace from a commercial mobile app market called AppChina, we can collect the statistics of the number of active Android apps, the average number apps installed in one Android device, and the expanding ratio of mobile apps. As mobile app market serves as the main line of defence against mobile malwares, our evaluation results show that it is practical to use cloud computing platform and data mining to verify all stored apps routinely to filter out malware apps from mobile app markets. As the future work, MobSafe can extensively use machine learning to conduct automotive forensic analysis of mobile apps based on the generated multifaceted data in this stage.