Big data analytics is a popular research topic due to its applicability in various real time applications.The recent advent of machine learning and deep learning models can be applied to analyze big data with better p...Big data analytics is a popular research topic due to its applicability in various real time applications.The recent advent of machine learning and deep learning models can be applied to analyze big data with better performance.Since big data involves numerous features and necessitates high computational time,feature selection methodologies using metaheuristic optimization algorithms can be adopted to choose optimum set of features and thereby improves the overall classification performance.This study proposes a new sigmoid butterfly optimization method with an optimum gated recurrent unit(SBOA-OGRU)model for big data classification in Apache Spark.The SBOA-OGRU technique involves the design of SBOA based feature selection technique to choose an optimum subset of features.In addition,OGRU based classification model is employed to classify the big data into appropriate classes.Besides,the hyperparameter tuning of the GRU model takes place using Adam optimizer.Furthermore,the Apache Spark platform is applied for processing big data in an effective way.In order to ensure the betterment of the SBOA-OGRU technique,a wide range of experiments were performed and the experimental results highlighted the supremacy of the SBOA-OGRU technique.展开更多
Big data applications in healthcare have provided a variety of solutions to reduce costs,errors,and waste.This work aims to develop a real-time system based on big medical data processing in the cloud for the predicti...Big data applications in healthcare have provided a variety of solutions to reduce costs,errors,and waste.This work aims to develop a real-time system based on big medical data processing in the cloud for the prediction of health issues.In the proposed scalable system,medical parameters are sent to Apache Spark to extract attributes from data and apply the proposed machine learning algorithm.In this way,healthcare risks can be predicted and sent as alerts and recommendations to users and healthcare providers.The proposed work also aims to provide an effective recommendation system by using streaming medical data,historical data on a user’s profile,and a knowledge database to make themost appropriate real-time recommendations and alerts based on the sensor’s measurements.This proposed scalable system works by tweeting the health status attributes of users.Their cloud profile receives the streaming healthcare data in real time by extracting the health attributes via a machine learning prediction algorithm to predict the users’health status.Subsequently,their status can be sent on demand to healthcare providers.Therefore,machine learning algorithms can be applied to stream health care data from wearables and provide users with insights into their health status.These algorithms can help healthcare providers and individuals focus on health risks and health status changes and consequently improve the quality of life.展开更多
This article delves into the intricate relationship between big data, cloud computing, and artificial intelligence, shedding light on their fundamental attributes and interdependence. It explores the seamless amalgama...This article delves into the intricate relationship between big data, cloud computing, and artificial intelligence, shedding light on their fundamental attributes and interdependence. It explores the seamless amalgamation of AI methodologies within cloud computing and big data analytics, encompassing the development of a cloud computing framework built on the robust foundation of the Hadoop platform, enriched by AI learning algorithms. Additionally, it examines the creation of a predictive model empowered by tailored artificial intelligence techniques. Rigorous simulations are conducted to extract valuable insights, facilitating method evaluation and performance assessment, all within the dynamic Hadoop environment, thereby reaffirming the precision of the proposed approach. The results and analysis section reveals compelling findings derived from comprehensive simulations within the Hadoop environment. These outcomes demonstrate the efficacy of the Sport AI Model (SAIM) framework in enhancing the accuracy of sports-related outcome predictions. Through meticulous mathematical analyses and performance assessments, integrating AI with big data emerges as a powerful tool for optimizing decision-making in sports. The discussion section extends the implications of these results, highlighting the potential for SAIM to revolutionize sports forecasting, strategic planning, and performance optimization for players and coaches. The combination of big data, cloud computing, and AI offers a promising avenue for future advancements in sports analytics. This research underscores the synergy between these technologies and paves the way for innovative approaches to sports-related decision-making and performance enhancement.展开更多
Earth observations and model simulations are generating big multidimensional array-based raster data.However,it is difficult to efficiently query these big raster data due to the inconsistency among the geospatial ras...Earth observations and model simulations are generating big multidimensional array-based raster data.However,it is difficult to efficiently query these big raster data due to the inconsistency among the geospatial raster data model,distributed physical data storage model,and the data pipeline in distributed computing frameworks.To efficiently process big geospatial data,this paper proposes a three-layer hierarchical indexing strategy to optimize Apache Spark with Hadoop Distributed File System(HDFS)from the following aspects:(1)improve I/O efficiency by adopting the chunking data structure;(2)keep the workload balance and high data locality by building the global index(k-d tree);(3)enable Spark and HDFS to natively support geospatial raster data formats(e.g.,HDF4,NetCDF4,GeoTiff)by building the local index(hash table);(4)index the in-memory data to further improve geospatial data queries;(5)develop a data repartition strategy to tune the query parallelism while keeping high data locality.The above strategies are implemented by developing the customized RDDs,and evaluated by comparing the performance with that of Spark SQL and SciSpark.The proposed indexing strategy can be applied to other distributed frameworks or cloud-based computing systems to natively support big geospatial data query with high efficiency.展开更多
Extracting and analyzing network traffic feature is fundamental in the design and implementation of network behavior anomaly detection methods. The traditional network traffic feature method focuses on the statistical...Extracting and analyzing network traffic feature is fundamental in the design and implementation of network behavior anomaly detection methods. The traditional network traffic feature method focuses on the statistical features of traffic volume. However, this approach is not sufficient to reflect the communication pattern features. A different approach is required to detect anomalous behaviors that do not exhibit traffic volume changes, such as low-intensity anomalous behaviors caused by Denial of Service/Distributed Denial of Service (DoS/DDoS) attacks, Internet worms and scanning, and BotNets. We propose an efficient traffic feature extraction architecture based on our proposed approach, which combines the benefit of traffic volume features and network communication pattern features. This method can detect low-intensity anomalous network behaviors and conventional traffic volume anomalies. We implemented our approach on Spark Streaming and validated our feature set using labelled real-world dataset collected from the Sichuan University campus network. Our results demonstrate that the traffic feature extraction approach is efficient in detecting both traffic variations and communication structure changes. Based on our evaluation of the MIT-DRAPA dataset, the same detection approach utilizes traffic volume features with detection precision of 82.3% and communication pattern features with detection precision of 89.9%. Our proposed feature set improves precision by 94%.展开更多
Purpose:We propose In Par Ten2,a multi-aspect parallel factor analysis three-dimensional tensor decomposition algorithm based on the Apache Spark framework.The proposed method reduces re-decomposition cost and can han...Purpose:We propose In Par Ten2,a multi-aspect parallel factor analysis three-dimensional tensor decomposition algorithm based on the Apache Spark framework.The proposed method reduces re-decomposition cost and can handle large tensors.Design/methodology/approach:Considering that tensor addition increases the size of a given tensor along all axes,the proposed method decomposes incoming tensors using existing decomposition results without generating sub-tensors.Additionally,In Par Ten2 avoids the calculation of Khari–Rao products and minimizes shuffling by using the Apache Spark platform.Findings:The performance of In Par Ten2 is evaluated by comparing its execution time and accuracy with those of existing distributed tensor decomposition methods on various datasets.The results confirm that In Par Ten2 can process large tensors and reduce the re-calculation cost of tensor decomposition.Consequently,the proposed method is faster than existing tensor decomposition algorithms and can significantly reduce re-decomposition cost.Research limitations:There are several Hadoop-based distributed tensor decomposition algorithms as well as MATLAB-based decomposition methods.However,the former require longer iteration time,and therefore their execution time cannot be compared with that of Spark-based algorithms,whereas the latter run on a single machine,thus limiting their ability to handle large data.Practical implications:The proposed algorithm can reduce re-decomposition cost when tensors are added to a given tensor by decomposing them based on existing decomposition results without re-decomposing the entire tensor.Originality/value:The proposed method can handle large tensors and is fast within the limited-memory framework of Apache Spark.Moreover,In Par Ten2 can handle static as well as incremental tensor decomposition.展开更多
文摘Big data analytics is a popular research topic due to its applicability in various real time applications.The recent advent of machine learning and deep learning models can be applied to analyze big data with better performance.Since big data involves numerous features and necessitates high computational time,feature selection methodologies using metaheuristic optimization algorithms can be adopted to choose optimum set of features and thereby improves the overall classification performance.This study proposes a new sigmoid butterfly optimization method with an optimum gated recurrent unit(SBOA-OGRU)model for big data classification in Apache Spark.The SBOA-OGRU technique involves the design of SBOA based feature selection technique to choose an optimum subset of features.In addition,OGRU based classification model is employed to classify the big data into appropriate classes.Besides,the hyperparameter tuning of the GRU model takes place using Adam optimizer.Furthermore,the Apache Spark platform is applied for processing big data in an effective way.In order to ensure the betterment of the SBOA-OGRU technique,a wide range of experiments were performed and the experimental results highlighted the supremacy of the SBOA-OGRU technique.
基金This study was financially supported by the Korea Health Technology R&D Project through the Korea Health Industry Development Institute(KHIDI),the Ministry of Health and Welfare(HI18C1216),and the Soonchunhyang University Research Fund.
文摘Big data applications in healthcare have provided a variety of solutions to reduce costs,errors,and waste.This work aims to develop a real-time system based on big medical data processing in the cloud for the prediction of health issues.In the proposed scalable system,medical parameters are sent to Apache Spark to extract attributes from data and apply the proposed machine learning algorithm.In this way,healthcare risks can be predicted and sent as alerts and recommendations to users and healthcare providers.The proposed work also aims to provide an effective recommendation system by using streaming medical data,historical data on a user’s profile,and a knowledge database to make themost appropriate real-time recommendations and alerts based on the sensor’s measurements.This proposed scalable system works by tweeting the health status attributes of users.Their cloud profile receives the streaming healthcare data in real time by extracting the health attributes via a machine learning prediction algorithm to predict the users’health status.Subsequently,their status can be sent on demand to healthcare providers.Therefore,machine learning algorithms can be applied to stream health care data from wearables and provide users with insights into their health status.These algorithms can help healthcare providers and individuals focus on health risks and health status changes and consequently improve the quality of life.
文摘This article delves into the intricate relationship between big data, cloud computing, and artificial intelligence, shedding light on their fundamental attributes and interdependence. It explores the seamless amalgamation of AI methodologies within cloud computing and big data analytics, encompassing the development of a cloud computing framework built on the robust foundation of the Hadoop platform, enriched by AI learning algorithms. Additionally, it examines the creation of a predictive model empowered by tailored artificial intelligence techniques. Rigorous simulations are conducted to extract valuable insights, facilitating method evaluation and performance assessment, all within the dynamic Hadoop environment, thereby reaffirming the precision of the proposed approach. The results and analysis section reveals compelling findings derived from comprehensive simulations within the Hadoop environment. These outcomes demonstrate the efficacy of the Sport AI Model (SAIM) framework in enhancing the accuracy of sports-related outcome predictions. Through meticulous mathematical analyses and performance assessments, integrating AI with big data emerges as a powerful tool for optimizing decision-making in sports. The discussion section extends the implications of these results, highlighting the potential for SAIM to revolutionize sports forecasting, strategic planning, and performance optimization for players and coaches. The combination of big data, cloud computing, and AI offers a promising avenue for future advancements in sports analytics. This research underscores the synergy between these technologies and paves the way for innovative approaches to sports-related decision-making and performance enhancement.
基金This research is funded by NASA(National Aeronautics and Space Administration)NCCS and AIST(NNX15AM85G)NSF I/UCRC,CSSI,and EarthCube Programs(1338925 and 1835507).
文摘Earth observations and model simulations are generating big multidimensional array-based raster data.However,it is difficult to efficiently query these big raster data due to the inconsistency among the geospatial raster data model,distributed physical data storage model,and the data pipeline in distributed computing frameworks.To efficiently process big geospatial data,this paper proposes a three-layer hierarchical indexing strategy to optimize Apache Spark with Hadoop Distributed File System(HDFS)from the following aspects:(1)improve I/O efficiency by adopting the chunking data structure;(2)keep the workload balance and high data locality by building the global index(k-d tree);(3)enable Spark and HDFS to natively support geospatial raster data formats(e.g.,HDF4,NetCDF4,GeoTiff)by building the local index(hash table);(4)index the in-memory data to further improve geospatial data queries;(5)develop a data repartition strategy to tune the query parallelism while keeping high data locality.The above strategies are implemented by developing the customized RDDs,and evaluated by comparing the performance with that of Spark SQL and SciSpark.The proposed indexing strategy can be applied to other distributed frameworks or cloud-based computing systems to natively support big geospatial data query with high efficiency.
基金supported by the National Natural Science Foundation of China (No. 61272447)Sichuan Province Science and Technology Planning (Nos. 2016GZ0042, 16ZHSF0483, and 2017GZ0168)+1 种基金Key Research Project of Sichuan Provincial Department of Education (Nos. 17ZA0238 and 17ZA0200)Scientific Research Staring Foundation for Young Teachers of Sichuan University (No. 2015SCU11079)
文摘Extracting and analyzing network traffic feature is fundamental in the design and implementation of network behavior anomaly detection methods. The traditional network traffic feature method focuses on the statistical features of traffic volume. However, this approach is not sufficient to reflect the communication pattern features. A different approach is required to detect anomalous behaviors that do not exhibit traffic volume changes, such as low-intensity anomalous behaviors caused by Denial of Service/Distributed Denial of Service (DoS/DDoS) attacks, Internet worms and scanning, and BotNets. We propose an efficient traffic feature extraction architecture based on our proposed approach, which combines the benefit of traffic volume features and network communication pattern features. This method can detect low-intensity anomalous network behaviors and conventional traffic volume anomalies. We implemented our approach on Spark Streaming and validated our feature set using labelled real-world dataset collected from the Sichuan University campus network. Our results demonstrate that the traffic feature extraction approach is efficient in detecting both traffic variations and communication structure changes. Based on our evaluation of the MIT-DRAPA dataset, the same detection approach utilizes traffic volume features with detection precision of 82.3% and communication pattern features with detection precision of 89.9%. Our proposed feature set improves precision by 94%.
基金supported by Basic Science Research Program through the National Research Foundation of Korea(NRF)funded by the Ministry of Education(NRF-2016R1D1A1B03931529)。
文摘Purpose:We propose In Par Ten2,a multi-aspect parallel factor analysis three-dimensional tensor decomposition algorithm based on the Apache Spark framework.The proposed method reduces re-decomposition cost and can handle large tensors.Design/methodology/approach:Considering that tensor addition increases the size of a given tensor along all axes,the proposed method decomposes incoming tensors using existing decomposition results without generating sub-tensors.Additionally,In Par Ten2 avoids the calculation of Khari–Rao products and minimizes shuffling by using the Apache Spark platform.Findings:The performance of In Par Ten2 is evaluated by comparing its execution time and accuracy with those of existing distributed tensor decomposition methods on various datasets.The results confirm that In Par Ten2 can process large tensors and reduce the re-calculation cost of tensor decomposition.Consequently,the proposed method is faster than existing tensor decomposition algorithms and can significantly reduce re-decomposition cost.Research limitations:There are several Hadoop-based distributed tensor decomposition algorithms as well as MATLAB-based decomposition methods.However,the former require longer iteration time,and therefore their execution time cannot be compared with that of Spark-based algorithms,whereas the latter run on a single machine,thus limiting their ability to handle large data.Practical implications:The proposed algorithm can reduce re-decomposition cost when tensors are added to a given tensor by decomposing them based on existing decomposition results without re-decomposing the entire tensor.Originality/value:The proposed method can handle large tensors and is fast within the limited-memory framework of Apache Spark.Moreover,In Par Ten2 can handle static as well as incremental tensor decomposition.