Recently,anomaly detection(AD)in streaming data gained significant attention among research communities due to its applicability in finance,business,healthcare,education,etc.The recent developments of deep learning(DL...Recently,anomaly detection(AD)in streaming data gained significant attention among research communities due to its applicability in finance,business,healthcare,education,etc.The recent developments of deep learning(DL)models find helpful in the detection and classification of anomalies.This article designs an oversampling with an optimal deep learning-based streaming data classification(OS-ODLSDC)model.The aim of the OSODLSDC model is to recognize and classify the presence of anomalies in the streaming data.The proposed OS-ODLSDC model initially undergoes preprocessing step.Since streaming data is unbalanced,support vector machine(SVM)-Synthetic Minority Over-sampling Technique(SVM-SMOTE)is applied for oversampling process.Besides,the OS-ODLSDC model employs bidirectional long short-term memory(Bi LSTM)for AD and classification.Finally,the root means square propagation(RMSProp)optimizer is applied for optimal hyperparameter tuning of the Bi LSTM model.For ensuring the promising performance of the OS-ODLSDC model,a wide-ranging experimental analysis is performed using three benchmark datasets such as CICIDS 2018,KDD-Cup 1999,and NSL-KDD datasets.展开更多
Due to the advancements in information technologies,massive quantity of data is being produced by social media,smartphones,and sensor devices.The investigation of data stream by the use of machine learning(ML)approach...Due to the advancements in information technologies,massive quantity of data is being produced by social media,smartphones,and sensor devices.The investigation of data stream by the use of machine learning(ML)approaches to address regression,prediction,and classification problems have received consid-erable interest.At the same time,the detection of anomalies or outliers and feature selection(FS)processes becomes important.This study develops an outlier detec-tion with feature selection technique for streaming data classification,named ODFST-SDC technique.Initially,streaming data is pre-processed in two ways namely categorical encoding and null value removal.In addition,Local Correla-tion Integral(LOCI)is used which is significant in the detection and removal of outliers.Besides,red deer algorithm(RDA)based FS approach is employed to derive an optimal subset of features.Finally,kernel extreme learning machine(KELM)classifier is used for streaming data classification.The design of LOCI based outlier detection and RDA based FS shows the novelty of the work.In order to assess the classification outcomes of the ODFST-SDC technique,a series of simulations were performed using three benchmark datasets.The experimental results reported the promising outcomes of the ODFST-SDC technique over the recent approaches.展开更多
Big data streams started becoming ubiquitous in recent years,thanks to rapid generation of massive volumes of data by different applications.It is challenging to apply existing data mining tools and techniques directl...Big data streams started becoming ubiquitous in recent years,thanks to rapid generation of massive volumes of data by different applications.It is challenging to apply existing data mining tools and techniques directly in these big data streams.At the same time,streaming data from several applications results in two major problems such as class imbalance and concept drift.The current research paper presents a new Multi-Objective Metaheuristic Optimization-based Big Data Analytics with Concept Drift Detection(MOMBD-CDD)method on High-Dimensional Streaming Data.The presented MOMBD-CDD model has different operational stages such as pre-processing,CDD,and classification.MOMBD-CDD model overcomes class imbalance problem by Synthetic Minority Over-sampling Technique(SMOTE).In order to determine the oversampling rates and neighboring point values of SMOTE,Glowworm Swarm Optimization(GSO)algorithm is employed.Besides,Statistical Test of Equal Proportions(STEPD),a CDD technique is also utilized.Finally,Bidirectional Long Short-Term Memory(Bi-LSTM)model is applied for classification.In order to improve classification performance and to compute the optimum parameters for Bi-LSTM model,GSO-based hyperparameter tuning process is carried out.The performance of the presented model was evaluated using high dimensional benchmark streaming datasets namely intrusion detection(NSL KDDCup)dataset and ECUE spam dataset.An extensive experimental validation process confirmed the effective outcome of MOMBD-CDD model.The proposed model attained high accuracy of 97.45%and 94.23%on the applied KDDCup99 Dataset and ECUE Spam datasets respectively.展开更多
Data stream clustering is integral to contemporary big data applications.However,addressing the ongoing influx of data streams efficiently and accurately remains a primary challenge in current research.This paper aims...Data stream clustering is integral to contemporary big data applications.However,addressing the ongoing influx of data streams efficiently and accurately remains a primary challenge in current research.This paper aims to elevate the efficiency and precision of data stream clustering,leveraging the TEDA(Typicality and Eccentricity Data Analysis)algorithm as a foundation,we introduce improvements by integrating a nearest neighbor search algorithm to enhance both the efficiency and accuracy of the algorithm.The original TEDA algorithm,grounded in the concept of“Typicality and Eccentricity Data Analytics”,represents an evolving and recursive method that requires no prior knowledge.While the algorithm autonomously creates and merges clusters as new data arrives,its efficiency is significantly hindered by the need to traverse all existing clusters upon the arrival of further data.This work presents the NS-TEDA(Neighbor Search Based Typicality and Eccentricity Data Analysis)algorithm by incorporating a KD-Tree(K-Dimensional Tree)algorithm integrated with the Scapegoat Tree.Upon arrival,this ensures that new data points interact solely with clusters in very close proximity.This significantly enhances algorithm efficiency while preventing a single data point from joining too many clusters and mitigating the merging of clusters with high overlap to some extent.We apply the NS-TEDA algorithm to several well-known datasets,comparing its performance with other data stream clustering algorithms and the original TEDA algorithm.The results demonstrate that the proposed algorithm achieves higher accuracy,and its runtime exhibits almost linear dependence on the volume of data,making it more suitable for large-scale data stream analysis research.展开更多
Clustering high dimensional data is challenging as data dimensionality increases the distance between data points,resulting in sparse regions that degrade clustering performance.Subspace clustering is a common approac...Clustering high dimensional data is challenging as data dimensionality increases the distance between data points,resulting in sparse regions that degrade clustering performance.Subspace clustering is a common approach for processing high-dimensional data by finding relevant features for each cluster in the data space.Subspace clustering methods extend traditional clustering to account for the constraints imposed by data streams.Data streams are not only high-dimensional,but also unbounded and evolving.This necessitates the development of subspace clustering algorithms that can handle high dimensionality and adapt to the unique characteristics of data streams.Although many articles have contributed to the literature review on data stream clustering,there is currently no specific review on subspace clustering algorithms in high-dimensional data streams.Therefore,this article aims to systematically review the existing literature on subspace clustering of data streams in high-dimensional streaming environments.The review follows a systematic methodological approach and includes 18 articles for the final analysis.The analysis focused on two research questions related to the general clustering process and dealing with the unbounded and evolving characteristics of data streams.The main findings relate to six elements:clustering process,cluster search,subspace search,synopsis structure,cluster maintenance,and evaluation measures.Most algorithms use a two-phase clustering approach consisting of an initialization stage,a refinement stage,a cluster maintenance stage,and a final clustering stage.The density-based top-down subspace clustering approach is more widely used than the others because it is able to distinguish true clusters and outliers using projected microclusters.Most algorithms implicitly adapt to the evolving nature of the data stream by using a time fading function that is sensitive to outliers.Future work can focus on the clustering framework,parameter optimization,subspace search techniques,memory-efficient synopsis structures,explicit cluster change detection,and intrinsic performance metrics.This article can serve as a guide for researchers interested in high-dimensional subspace clustering methods for data streams.展开更多
Handling sentiment drifts in real time twitter data streams are a challen-ging task while performing sentiment classifications,because of the changes that occur in the sentiments of twitter users,with respect to time....Handling sentiment drifts in real time twitter data streams are a challen-ging task while performing sentiment classifications,because of the changes that occur in the sentiments of twitter users,with respect to time.The growing volume of tweets with sentiment drifts has led to the need for devising an adaptive approach to detect and handle this drift in real time.This work proposes an adap-tive learning algorithm-based framework,Twitter Sentiment Drift Analysis-Bidir-ectional Encoder Representations from Transformers(TSDA-BERT),which introduces a sentiment drift measure to detect drifts and a domain impact score to adaptively retrain the classification model with domain relevant data in real time.The framework also works on static data by converting them to data streams using the Kafka tool.The experiments conducted on real time and simulated tweets of sports,health care andfinancial topics show that the proposed system is able to detect sentiment drifts and maintain the performance of the classification model,with accuracies of 91%,87%and 90%,respectively.Though the results have been provided only for a few topics,as a proof of concept,this framework can be applied to detect sentiment drifts and perform sentiment classification on real time data streams of any topic.展开更多
Edge-computing-enabled smart greenhouses are a representative application of the Internet of Things(IoT)technology,which can monitor the environmental information in real-time and employ the information to contribute ...Edge-computing-enabled smart greenhouses are a representative application of the Internet of Things(IoT)technology,which can monitor the environmental information in real-time and employ the information to contribute to intelligent decision-making.In the process,anomaly detection for wireless sensor data plays an important role.However,the traditional anomaly detection algorithms originally designed for anomaly detection in static data do not properly consider the inherent characteristics of the data stream produced by wireless sensors such as infiniteness,correlations,and concept drift,which may pose a considerable challenge to anomaly detection based on data stream and lead to low detection accuracy and efficiency.First,the data stream is usually generated quickly,which means that the data stream is infinite and enormous.Hence,any traditional off-line anomaly detection algorithm that attempts to store the whole dataset or to scan the dataset multiple times for anomaly detection will run out of memory space.Second,there exist correlations among different data streams,and traditional algorithms hardly consider these correlations.Third,the underlying data generation process or distribution may change over time.Thus,traditional anomaly detection algorithms with no model update will lose their effects.Considering these issues,a novel method(called DLSHiForest)based on Locality-Sensitive Hashing and the time window technique is proposed to solve these problems while achieving accurate and efficient detection.Comprehensive experiments are executed using a real-world agricultural greenhouse dataset to demonstrate the feasibility of our approach.Experimental results show that our proposal is practical for addressing the challenges of traditional anomaly detection while ensuring accuracy and efficiency.展开更多
Automatic Identification System(AIS)data stream analysis is based on the AIS data of different vessel’s behaviours,including the vessels’routes.When the AIS data consists of outliers,noises,or are incomplete,then th...Automatic Identification System(AIS)data stream analysis is based on the AIS data of different vessel’s behaviours,including the vessels’routes.When the AIS data consists of outliers,noises,or are incomplete,then the analysis of the vessel’s behaviours is not possible or is limited.When the data consists of outliers,it is not possible to automatically assign the AIS data to a particular vessel.In this paper,a clustering method is proposed to support the AIS data analysis,to qualify noises and outliers with respect to their suitability,and finally to aid the reconstruction of the vessel’s trajectory.In this paper,clustering results have been obtained using selected algorithms,including k-means,k-medoids,and fuzzy c-means.Based on the clustering results,it is possible to decide on the qualification of data with outliers and on their usefulness in the reconstruction of the vessel trajectory.The main aim of this paper is to answer how different distance measures during a clustering process can influence AIS data clustering quality.The main core question is whether or not they have an impact on the process of reconstruction of the vessel trajectories when the data are damaged.The research question during the computational experiments asked whether or not distance measure influence AIS data clustering quality.The computational experiments have been carried out using original AIS data.In general,the experiment and the results confirm the usefulness of the cluster-based analysis when the data include outliers that are derived from the natural environment.It is also possible to monitor and to analyse AIS data using clustering when the data include outliers.The computational experiment results confirm that the k-means with Euclidean distance has the best performance.展开更多
At this current time,data stream classification plays a key role in big data analytics due to its enormous growth.Most of the existing classification methods used ensemble learning,which is trustworthy but these metho...At this current time,data stream classification plays a key role in big data analytics due to its enormous growth.Most of the existing classification methods used ensemble learning,which is trustworthy but these methods are not effective to face the issues of learning from imbalanced big data,it also supposes that all data are pre-classified.Another weakness of current methods is that it takes a long evaluation time when the target data stream contains a high number of features.The main objective of this research is to develop a new method for incremental learning based on the proposed ant lion fuzzy-generative adversarial network model.The proposed model is implemented in spark architecture.For each data stream,the class output is computed at slave nodes by training a generative adversarial network with the back propagation error based on fuzzy bound computation.This method overcomes the limitations of existing methods as it can classify data streams that are slightly or completely unlabeled data and providing high scalability and efficiency.The results show that the proposed model outperforms stateof-the-art performance in terms of accuracy(0.861)precision(0.9328)and minimal MSE(0.0416).展开更多
Microservices have become popular in enterprises because of their excellent scalability and timely update capabilities.However,while fine-grained modularity and service-orientation decrease the complexity of system de...Microservices have become popular in enterprises because of their excellent scalability and timely update capabilities.However,while fine-grained modularity and service-orientation decrease the complexity of system development,the complexity of system operation and maintenance has been greatly increased,on the contrary.Multiple types of system failures occur frequently,and it is hard to detect and diagnose failures in time.Furthermore,microservices are updated frequently.Existing anomaly detection models depend on offline training and cannot adapt to the frequent updates of microservices.This paper proposes an anomaly detection approach for microservice systems with multi-source data streams.This approach realizes online model construction and online anomaly detection,and is capable of self-updating and self-adapting.Experimental results show that this approach can correctly identify 78.85%of faults of different types.展开更多
The analytical capacity of massive data has become increasingly necessary, given the high volume of data that has been generated daily by different sources. The data sources are varied and can generate a huge amount o...The analytical capacity of massive data has become increasingly necessary, given the high volume of data that has been generated daily by different sources. The data sources are varied and can generate a huge amount of data, which can be processed in batch or stream settings. The stream setting corresponds to the treatment of a continuous sequence of data that arrives in real-time flow and needs to be processed in real-time. The models, tools, methods and algorithms for generating intelligence from data stream culminate in the approaches of Data Stream Mining and Data Stream Learning. The activities of such approaches can be organized and structured according to Engineering principles, thus allowing the principles of Analytical Engineering, or more specifically, Analytical Engineering for Data Stream (AEDS). Thus, this article presents the AEDS conceptual framework composed of four pillars (Data, Model, Tool, People) and three processes (Acquisition, Retention, Review). The definition of these pillars and processes is carried out based on the main components of data stream setting, corresponding to four pillars, and also on the necessity to operationalize the activities of an Analytical Organization (AO) in the use of AEDS four pillars, which determines the three proposed processes. The AEDS framework favors the projects carried out in an AO, that is, its Analytical Projects (AP), to favor the delivery of results, or Analytical Deliverables (AD), carried out by the Analytical Teams (AT) in order to provide intelligence from stream data.展开更多
Textual data streams have been extensively used in practical applications where consumers of online products have expressed their views regarding online products.Due to changes in data distribution,commonly referred t...Textual data streams have been extensively used in practical applications where consumers of online products have expressed their views regarding online products.Due to changes in data distribution,commonly referred to as concept drift,mining this data stream is a challenging problem for researchers.The majority of the existing drift detection techniques are based on classification errors,which have higher probabilities of false-positive or missed detections.To improve classification accuracy,there is a need to develop more intuitive detection techniques that can identify a great number of drifts in the data streams.This paper presents an adaptive unsupervised learning technique,an ensemble classifier based on drift detection for opinion mining and sentiment classification.To improve classification performance,this approach uses four different dissimilarity measures to determine the degree of concept drifts in the data stream.Whenever a drift is detected,the proposed method builds and adds a new classifier to the ensemble.To add a new classifier,the total number of classifiers in the ensemble is first checked if the limit is exceeded before the classifier with the least weight is removed from the ensemble.To this end,a weighting mechanism is used to calculate the weight of each classifier,which decides the contribution of each classifier in the final classification results.Several experiments were conducted on real-world datasets and the resultswere evaluated on the false positive rate,miss detection rate,and accuracy measures.The proposed method is also compared with the state-of-the-art methods,which include DDM,EDDM,and PageHinkley with support vector machine(SVM)and Naive Bayes classifiers that are frequently used in concept drift detection studies.In all cases,the results show the efficiency of our proposed method.展开更多
Every application in a smart city environment like the smart grid,health monitoring, security, and surveillance generates non-stationary datastreams. Due to such nature, the statistical properties of data changes over...Every application in a smart city environment like the smart grid,health monitoring, security, and surveillance generates non-stationary datastreams. Due to such nature, the statistical properties of data changes overtime, leading to class imbalance and concept drift issues. Both these issuescause model performance degradation. Most of the current work has beenfocused on developing an ensemble strategy by training a new classifier on thelatest data to resolve the issue. These techniques suffer while training the newclassifier if the data is imbalanced. Also, the class imbalance ratio may changegreatly from one input stream to another, making the problem more complex.The existing solutions proposed for addressing the combined issue of classimbalance and concept drift are lacking in understating of correlation of oneproblem with the other. This work studies the association between conceptdrift and class imbalance ratio and then demonstrates how changes in classimbalance ratio along with concept drift affect the classifier’s performance.We analyzed the effect of both the issues on minority and majority classesindividually. To do this, we conducted experiments on benchmark datasetsusing state-of-the-art classifiers especially designed for data stream classification.Precision, recall, F1 score, and geometric mean were used to measure theperformance. Our findings show that when both class imbalance and conceptdrift problems occur together the performance can decrease up to 15%. Ourresults also show that the increase in the imbalance ratio can cause a 10% to15% decrease in the precision scores of both minority and majority classes.The study findings may help in designing intelligent and adaptive solutionsthat can cope with the challenges of non-stationary data streams like conceptdrift and class imbalance.展开更多
With the enhancement of data collection capabilities,massive streaming data have been accumulated in numerous application scenarios.Specifically,the issue of classifying data streams based on mobile sensors can be for...With the enhancement of data collection capabilities,massive streaming data have been accumulated in numerous application scenarios.Specifically,the issue of classifying data streams based on mobile sensors can be formalized as a multi-task multi-view learning problem with a specific task comprising multiple views with shared features collected from multiple sensors.Existing incremental learning methods are often single-task single-view,which cannot learn shared representations between relevant tasks and views.An adaptive multi-task multi-view incremental learning framework for data stream classification called MTMVIS is proposed to address the above challenges,utilizing the idea of multi-task multi-view learning.Specifically,the attention mechanism is first used to align different sensor data of different views.In addition,MTMVIS uses adaptive Fisher regularization from the perspective of multi-task multi-view learning to overcome catastrophic forgetting in incremental learning.Results reveal that the proposed framework outperforms state-of-the-art methods based on the experiments on two different datasets with other baselines.展开更多
The expanding amounts of information created by Internet of Things(IoT)devices places a strain on cloud computing,which is often used for data analysis and storage.This paper investigates a different approach based on...The expanding amounts of information created by Internet of Things(IoT)devices places a strain on cloud computing,which is often used for data analysis and storage.This paper investigates a different approach based on edge cloud applications,which involves data filtering and processing before being delivered to a backup cloud environment.This Paper suggest designing and implementing a low cost,low power cluster of Single Board Computers(SBC)for this purpose,reducing the amount of data that must be transmitted elsewhere,using Big Data ideas and technology.An Apache Hadoop and Spark Cluster that was used to run a test application was containerized and deployed using a Raspberry Pi cluster and Docker.To obtain system data and analyze the setup’s performance a Prometheusbased stack monitoring and alerting solution in the cloud based market is employed.This Paper assesses the system’s complexity and demonstrates how containerization can improve fault tolerance and maintenance ease,allowing the suggested solution to be used in industry.An evaluation of the overall performance is presented to highlight the capabilities and limitations of the suggested architecture,taking into consideration the suggested solution’s resource use in respect to device restrictions.展开更多
Most stream data classification algorithms apply the supervised learning strategy which requires massive labeled data.Such approaches are impractical since labeled data are usually hard to obtain in reality.In this pa...Most stream data classification algorithms apply the supervised learning strategy which requires massive labeled data.Such approaches are impractical since labeled data are usually hard to obtain in reality.In this paper,we build a clustering feature decision tree model,CFDT,from data streams having both unlabeled and a small number of labeled examples.CFDT applies a micro-clustering algorithm that scans the data only once to provide the statistical summaries of the data for incremental decision tree induction.Micro-clusters also serve as classifiers in tree leaves to improve classification accuracy and reinforce the any-time property.Our experiments on synthetic and real-world datasets show that CFDT is highly scalable for data streams while gener-ating high classification accuracy with high speed.展开更多
One recent area of interest in computer science is data stream management and processing. By ‘data stream', we refer to continuous and rapidly generated packages of data. Specific features of data streams are imm...One recent area of interest in computer science is data stream management and processing. By ‘data stream', we refer to continuous and rapidly generated packages of data. Specific features of data streams are immense volume, high production rate, limited data processing time, and data concept drift; these features differentiate the data stream from standard types of data. An issue for the data stream is classification of input data. A novel ensemble classifier is proposed in this paper. The classifier uses base classifiers of two weighting functions under different data input conditions. In addition, a new method is used to determine drift, which emphasizes the precision of the algorithm. Another characteristic of the proposed method is removal of different numbers of the base classifiers based on their quality. Implementation of a weighting mechanism to the base classifiers at the decision-making stage is another advantage of the algorithm. This facilitates adaptability when drifts take place, which leads to classifiers with higher efficiency. Furthermore, the proposed method is tested on a set of standard data and the results confirm higher accuracy compared to available ensemble classifiers and single classifiers. In addition, in some cases the proposed classifier is faster and needs less storage space.展开更多
Online anomaly detection for stream data has been explored recently,where the detector is supposed to be able to perform an accurate and timely judgment for the upcoming observation.However,due to the inherent complex...Online anomaly detection for stream data has been explored recently,where the detector is supposed to be able to perform an accurate and timely judgment for the upcoming observation.However,due to the inherent complex characteristics of stream data,such as quick generation,tremendous volume and dynamic evolution distribution,how to develop an effective online anomaly detection method is a challenge.The main objective of this paper is to propose an adaptive online anomaly detection method for stream data.This is achieved by combining isolation principle with online ensemble learning,which is then optimized by statistic histogram.Three main algorithms are developed,i.e.,online detector building algorithm,anomaly detecting algorithm and adaptive detector updating algorithm.To evaluate our proposed method,four massive datasets from the UCI machine learning repository recorded from real events were adopted.Extensive simulations based on these datasets show that our method is effective and robust against different scenarios.展开更多
The prevalence of missing values in the data streams collected in real environments makes them impossible to ignore in the privacy preservation of data streams.However,the development of most privacy preservation meth...The prevalence of missing values in the data streams collected in real environments makes them impossible to ignore in the privacy preservation of data streams.However,the development of most privacy preservation methods does not consider missing values.A few researches allow them to participate in data anonymization but introduce extra considerable information loss.To balance the utility and privacy preservation of incomplete data streams,we present a utility-enhanced approach for Incomplete Data strEam Anonymization(IDEA).In this approach,a slide-window-based processing framework is introduced to anonymize data streams continuously,in which each tuple can be output with clustering or anonymized clusters.We consider the dimensions of attribute and tuple as the similarity measurement,which enables the clustering between incomplete records and complete records and generates the cluster with minimal information loss.To avoid the missing value pollution,we propose a generalization method that is based on maybe match for generalizing incomplete data.The experiments conducted on real datasets show that the proposed approach can efficiently anonymize incomplete data streams while effectively preserving utility.展开更多
Pushed by the Internet of Things(IoT)paradigm modern sensor networks monitor a wide range of phenomena,in areas such as environmental monitoring,health care,industrial processes,and smart cities.These networks provide...Pushed by the Internet of Things(IoT)paradigm modern sensor networks monitor a wide range of phenomena,in areas such as environmental monitoring,health care,industrial processes,and smart cities.These networks provide a continuous pulse of the almost infinite activities that are happening in the physical space and are thus,key enablers for a Digital Earth Nervous System.Nevertheless,the rapid processing of these sensor data streams still continues to challenge traditional data-handling solutions and new approaches are being requested.We propose a generic answer to this challenge,which has the potential to support any form of distributed real-time analysis.This neutral methodology follows a brokering approach to work with different kinds of data sources and uses web-based standards to achieve interoperability.As a proof of concept,we implemented the methodology to detect anomalies in real-time and applied it to the area of environmental monitoring.The developed system is capable of detecting anomalies,generating notifications,and displaying the recent situation to the user.展开更多
文摘Recently,anomaly detection(AD)in streaming data gained significant attention among research communities due to its applicability in finance,business,healthcare,education,etc.The recent developments of deep learning(DL)models find helpful in the detection and classification of anomalies.This article designs an oversampling with an optimal deep learning-based streaming data classification(OS-ODLSDC)model.The aim of the OSODLSDC model is to recognize and classify the presence of anomalies in the streaming data.The proposed OS-ODLSDC model initially undergoes preprocessing step.Since streaming data is unbalanced,support vector machine(SVM)-Synthetic Minority Over-sampling Technique(SVM-SMOTE)is applied for oversampling process.Besides,the OS-ODLSDC model employs bidirectional long short-term memory(Bi LSTM)for AD and classification.Finally,the root means square propagation(RMSProp)optimizer is applied for optimal hyperparameter tuning of the Bi LSTM model.For ensuring the promising performance of the OS-ODLSDC model,a wide-ranging experimental analysis is performed using three benchmark datasets such as CICIDS 2018,KDD-Cup 1999,and NSL-KDD datasets.
文摘Due to the advancements in information technologies,massive quantity of data is being produced by social media,smartphones,and sensor devices.The investigation of data stream by the use of machine learning(ML)approaches to address regression,prediction,and classification problems have received consid-erable interest.At the same time,the detection of anomalies or outliers and feature selection(FS)processes becomes important.This study develops an outlier detec-tion with feature selection technique for streaming data classification,named ODFST-SDC technique.Initially,streaming data is pre-processed in two ways namely categorical encoding and null value removal.In addition,Local Correla-tion Integral(LOCI)is used which is significant in the detection and removal of outliers.Besides,red deer algorithm(RDA)based FS approach is employed to derive an optimal subset of features.Finally,kernel extreme learning machine(KELM)classifier is used for streaming data classification.The design of LOCI based outlier detection and RDA based FS shows the novelty of the work.In order to assess the classification outcomes of the ODFST-SDC technique,a series of simulations were performed using three benchmark datasets.The experimental results reported the promising outcomes of the ODFST-SDC technique over the recent approaches.
文摘Big data streams started becoming ubiquitous in recent years,thanks to rapid generation of massive volumes of data by different applications.It is challenging to apply existing data mining tools and techniques directly in these big data streams.At the same time,streaming data from several applications results in two major problems such as class imbalance and concept drift.The current research paper presents a new Multi-Objective Metaheuristic Optimization-based Big Data Analytics with Concept Drift Detection(MOMBD-CDD)method on High-Dimensional Streaming Data.The presented MOMBD-CDD model has different operational stages such as pre-processing,CDD,and classification.MOMBD-CDD model overcomes class imbalance problem by Synthetic Minority Over-sampling Technique(SMOTE).In order to determine the oversampling rates and neighboring point values of SMOTE,Glowworm Swarm Optimization(GSO)algorithm is employed.Besides,Statistical Test of Equal Proportions(STEPD),a CDD technique is also utilized.Finally,Bidirectional Long Short-Term Memory(Bi-LSTM)model is applied for classification.In order to improve classification performance and to compute the optimum parameters for Bi-LSTM model,GSO-based hyperparameter tuning process is carried out.The performance of the presented model was evaluated using high dimensional benchmark streaming datasets namely intrusion detection(NSL KDDCup)dataset and ECUE spam dataset.An extensive experimental validation process confirmed the effective outcome of MOMBD-CDD model.The proposed model attained high accuracy of 97.45%and 94.23%on the applied KDDCup99 Dataset and ECUE Spam datasets respectively.
基金This research was funded by the National Natural Science Foundation of China(Grant No.72001190)by the Ministry of Education’s Humanities and Social Science Project via the China Ministry of Education(Grant No.20YJC630173)by Zhejiang A&F University(Grant No.2022LFR062).
文摘Data stream clustering is integral to contemporary big data applications.However,addressing the ongoing influx of data streams efficiently and accurately remains a primary challenge in current research.This paper aims to elevate the efficiency and precision of data stream clustering,leveraging the TEDA(Typicality and Eccentricity Data Analysis)algorithm as a foundation,we introduce improvements by integrating a nearest neighbor search algorithm to enhance both the efficiency and accuracy of the algorithm.The original TEDA algorithm,grounded in the concept of“Typicality and Eccentricity Data Analytics”,represents an evolving and recursive method that requires no prior knowledge.While the algorithm autonomously creates and merges clusters as new data arrives,its efficiency is significantly hindered by the need to traverse all existing clusters upon the arrival of further data.This work presents the NS-TEDA(Neighbor Search Based Typicality and Eccentricity Data Analysis)algorithm by incorporating a KD-Tree(K-Dimensional Tree)algorithm integrated with the Scapegoat Tree.Upon arrival,this ensures that new data points interact solely with clusters in very close proximity.This significantly enhances algorithm efficiency while preventing a single data point from joining too many clusters and mitigating the merging of clusters with high overlap to some extent.We apply the NS-TEDA algorithm to several well-known datasets,comparing its performance with other data stream clustering algorithms and the original TEDA algorithm.The results demonstrate that the proposed algorithm achieves higher accuracy,and its runtime exhibits almost linear dependence on the volume of data,making it more suitable for large-scale data stream analysis research.
文摘Clustering high dimensional data is challenging as data dimensionality increases the distance between data points,resulting in sparse regions that degrade clustering performance.Subspace clustering is a common approach for processing high-dimensional data by finding relevant features for each cluster in the data space.Subspace clustering methods extend traditional clustering to account for the constraints imposed by data streams.Data streams are not only high-dimensional,but also unbounded and evolving.This necessitates the development of subspace clustering algorithms that can handle high dimensionality and adapt to the unique characteristics of data streams.Although many articles have contributed to the literature review on data stream clustering,there is currently no specific review on subspace clustering algorithms in high-dimensional data streams.Therefore,this article aims to systematically review the existing literature on subspace clustering of data streams in high-dimensional streaming environments.The review follows a systematic methodological approach and includes 18 articles for the final analysis.The analysis focused on two research questions related to the general clustering process and dealing with the unbounded and evolving characteristics of data streams.The main findings relate to six elements:clustering process,cluster search,subspace search,synopsis structure,cluster maintenance,and evaluation measures.Most algorithms use a two-phase clustering approach consisting of an initialization stage,a refinement stage,a cluster maintenance stage,and a final clustering stage.The density-based top-down subspace clustering approach is more widely used than the others because it is able to distinguish true clusters and outliers using projected microclusters.Most algorithms implicitly adapt to the evolving nature of the data stream by using a time fading function that is sensitive to outliers.Future work can focus on the clustering framework,parameter optimization,subspace search techniques,memory-efficient synopsis structures,explicit cluster change detection,and intrinsic performance metrics.This article can serve as a guide for researchers interested in high-dimensional subspace clustering methods for data streams.
文摘Handling sentiment drifts in real time twitter data streams are a challen-ging task while performing sentiment classifications,because of the changes that occur in the sentiments of twitter users,with respect to time.The growing volume of tweets with sentiment drifts has led to the need for devising an adaptive approach to detect and handle this drift in real time.This work proposes an adap-tive learning algorithm-based framework,Twitter Sentiment Drift Analysis-Bidir-ectional Encoder Representations from Transformers(TSDA-BERT),which introduces a sentiment drift measure to detect drifts and a domain impact score to adaptively retrain the classification model with domain relevant data in real time.The framework also works on static data by converting them to data streams using the Kafka tool.The experiments conducted on real time and simulated tweets of sports,health care andfinancial topics show that the proposed system is able to detect sentiment drifts and maintain the performance of the classification model,with accuracies of 91%,87%and 90%,respectively.Though the results have been provided only for a few topics,as a proof of concept,this framework can be applied to detect sentiment drifts and perform sentiment classification on real time data streams of any topic.
基金supported in part by the Fundamental Research Funds for the Central Universities under Grant No.30919011282.
文摘Edge-computing-enabled smart greenhouses are a representative application of the Internet of Things(IoT)technology,which can monitor the environmental information in real-time and employ the information to contribute to intelligent decision-making.In the process,anomaly detection for wireless sensor data plays an important role.However,the traditional anomaly detection algorithms originally designed for anomaly detection in static data do not properly consider the inherent characteristics of the data stream produced by wireless sensors such as infiniteness,correlations,and concept drift,which may pose a considerable challenge to anomaly detection based on data stream and lead to low detection accuracy and efficiency.First,the data stream is usually generated quickly,which means that the data stream is infinite and enormous.Hence,any traditional off-line anomaly detection algorithm that attempts to store the whole dataset or to scan the dataset multiple times for anomaly detection will run out of memory space.Second,there exist correlations among different data streams,and traditional algorithms hardly consider these correlations.Third,the underlying data generation process or distribution may change over time.Thus,traditional anomaly detection algorithms with no model update will lose their effects.Considering these issues,a novel method(called DLSHiForest)based on Locality-Sensitive Hashing and the time window technique is proposed to solve these problems while achieving accurate and efficient detection.Comprehensive experiments are executed using a real-world agricultural greenhouse dataset to demonstrate the feasibility of our approach.Experimental results show that our proposal is practical for addressing the challenges of traditional anomaly detection while ensuring accuracy and efficiency.
文摘Automatic Identification System(AIS)data stream analysis is based on the AIS data of different vessel’s behaviours,including the vessels’routes.When the AIS data consists of outliers,noises,or are incomplete,then the analysis of the vessel’s behaviours is not possible or is limited.When the data consists of outliers,it is not possible to automatically assign the AIS data to a particular vessel.In this paper,a clustering method is proposed to support the AIS data analysis,to qualify noises and outliers with respect to their suitability,and finally to aid the reconstruction of the vessel’s trajectory.In this paper,clustering results have been obtained using selected algorithms,including k-means,k-medoids,and fuzzy c-means.Based on the clustering results,it is possible to decide on the qualification of data with outliers and on their usefulness in the reconstruction of the vessel trajectory.The main aim of this paper is to answer how different distance measures during a clustering process can influence AIS data clustering quality.The main core question is whether or not they have an impact on the process of reconstruction of the vessel trajectories when the data are damaged.The research question during the computational experiments asked whether or not distance measure influence AIS data clustering quality.The computational experiments have been carried out using original AIS data.In general,the experiment and the results confirm the usefulness of the cluster-based analysis when the data include outliers that are derived from the natural environment.It is also possible to monitor and to analyse AIS data using clustering when the data include outliers.The computational experiment results confirm that the k-means with Euclidean distance has the best performance.
基金Taif University Researchers Supporting Project Number(TURSP-2020/126),Taif University,Taif,Saudi Arabia.
文摘At this current time,data stream classification plays a key role in big data analytics due to its enormous growth.Most of the existing classification methods used ensemble learning,which is trustworthy but these methods are not effective to face the issues of learning from imbalanced big data,it also supposes that all data are pre-classified.Another weakness of current methods is that it takes a long evaluation time when the target data stream contains a high number of features.The main objective of this research is to develop a new method for incremental learning based on the proposed ant lion fuzzy-generative adversarial network model.The proposed model is implemented in spark architecture.For each data stream,the class output is computed at slave nodes by training a generative adversarial network with the back propagation error based on fuzzy bound computation.This method overcomes the limitations of existing methods as it can classify data streams that are slightly or completely unlabeled data and providing high scalability and efficiency.The results show that the proposed model outperforms stateof-the-art performance in terms of accuracy(0.861)precision(0.9328)and minimal MSE(0.0416).
基金supported by ZTE Industry-University-Institute Cooperation Funds under Grant No.HF-CN-202008200001。
文摘Microservices have become popular in enterprises because of their excellent scalability and timely update capabilities.However,while fine-grained modularity and service-orientation decrease the complexity of system development,the complexity of system operation and maintenance has been greatly increased,on the contrary.Multiple types of system failures occur frequently,and it is hard to detect and diagnose failures in time.Furthermore,microservices are updated frequently.Existing anomaly detection models depend on offline training and cannot adapt to the frequent updates of microservices.This paper proposes an anomaly detection approach for microservice systems with multi-source data streams.This approach realizes online model construction and online anomaly detection,and is capable of self-updating and self-adapting.Experimental results show that this approach can correctly identify 78.85%of faults of different types.
文摘The analytical capacity of massive data has become increasingly necessary, given the high volume of data that has been generated daily by different sources. The data sources are varied and can generate a huge amount of data, which can be processed in batch or stream settings. The stream setting corresponds to the treatment of a continuous sequence of data that arrives in real-time flow and needs to be processed in real-time. The models, tools, methods and algorithms for generating intelligence from data stream culminate in the approaches of Data Stream Mining and Data Stream Learning. The activities of such approaches can be organized and structured according to Engineering principles, thus allowing the principles of Analytical Engineering, or more specifically, Analytical Engineering for Data Stream (AEDS). Thus, this article presents the AEDS conceptual framework composed of four pillars (Data, Model, Tool, People) and three processes (Acquisition, Retention, Review). The definition of these pillars and processes is carried out based on the main components of data stream setting, corresponding to four pillars, and also on the necessity to operationalize the activities of an Analytical Organization (AO) in the use of AEDS four pillars, which determines the three proposed processes. The AEDS framework favors the projects carried out in an AO, that is, its Analytical Projects (AP), to favor the delivery of results, or Analytical Deliverables (AD), carried out by the Analytical Teams (AT) in order to provide intelligence from stream data.
基金The authors extend their appreciation to the Deanship of Scientific Research at King Khalid University for funding this work through Large Groups(Project under Grant Number(RGP.2/49/43)).
文摘Textual data streams have been extensively used in practical applications where consumers of online products have expressed their views regarding online products.Due to changes in data distribution,commonly referred to as concept drift,mining this data stream is a challenging problem for researchers.The majority of the existing drift detection techniques are based on classification errors,which have higher probabilities of false-positive or missed detections.To improve classification accuracy,there is a need to develop more intuitive detection techniques that can identify a great number of drifts in the data streams.This paper presents an adaptive unsupervised learning technique,an ensemble classifier based on drift detection for opinion mining and sentiment classification.To improve classification performance,this approach uses four different dissimilarity measures to determine the degree of concept drifts in the data stream.Whenever a drift is detected,the proposed method builds and adds a new classifier to the ensemble.To add a new classifier,the total number of classifiers in the ensemble is first checked if the limit is exceeded before the classifier with the least weight is removed from the ensemble.To this end,a weighting mechanism is used to calculate the weight of each classifier,which decides the contribution of each classifier in the final classification results.Several experiments were conducted on real-world datasets and the resultswere evaluated on the false positive rate,miss detection rate,and accuracy measures.The proposed method is also compared with the state-of-the-art methods,which include DDM,EDDM,and PageHinkley with support vector machine(SVM)and Naive Bayes classifiers that are frequently used in concept drift detection studies.In all cases,the results show the efficiency of our proposed method.
基金The authors would like to extend their gratitude to Universiti Teknologi PETRONAS (Malaysia)for funding this research through grant number (015LA0-037).
文摘Every application in a smart city environment like the smart grid,health monitoring, security, and surveillance generates non-stationary datastreams. Due to such nature, the statistical properties of data changes overtime, leading to class imbalance and concept drift issues. Both these issuescause model performance degradation. Most of the current work has beenfocused on developing an ensemble strategy by training a new classifier on thelatest data to resolve the issue. These techniques suffer while training the newclassifier if the data is imbalanced. Also, the class imbalance ratio may changegreatly from one input stream to another, making the problem more complex.The existing solutions proposed for addressing the combined issue of classimbalance and concept drift are lacking in understating of correlation of oneproblem with the other. This work studies the association between conceptdrift and class imbalance ratio and then demonstrates how changes in classimbalance ratio along with concept drift affect the classifier’s performance.We analyzed the effect of both the issues on minority and majority classesindividually. To do this, we conducted experiments on benchmark datasetsusing state-of-the-art classifiers especially designed for data stream classification.Precision, recall, F1 score, and geometric mean were used to measure theperformance. Our findings show that when both class imbalance and conceptdrift problems occur together the performance can decrease up to 15%. Ourresults also show that the increase in the imbalance ratio can cause a 10% to15% decrease in the precision scores of both minority and majority classes.The study findings may help in designing intelligent and adaptive solutionsthat can cope with the challenges of non-stationary data streams like conceptdrift and class imbalance.
文摘With the enhancement of data collection capabilities,massive streaming data have been accumulated in numerous application scenarios.Specifically,the issue of classifying data streams based on mobile sensors can be formalized as a multi-task multi-view learning problem with a specific task comprising multiple views with shared features collected from multiple sensors.Existing incremental learning methods are often single-task single-view,which cannot learn shared representations between relevant tasks and views.An adaptive multi-task multi-view incremental learning framework for data stream classification called MTMVIS is proposed to address the above challenges,utilizing the idea of multi-task multi-view learning.Specifically,the attention mechanism is first used to align different sensor data of different views.In addition,MTMVIS uses adaptive Fisher regularization from the perspective of multi-task multi-view learning to overcome catastrophic forgetting in incremental learning.Results reveal that the proposed framework outperforms state-of-the-art methods based on the experiments on two different datasets with other baselines.
基金This research project was supported by a grant from the“Research Center of College of Computer and Information Sciences”,Deanship of Scientific Research,King Saud University.
文摘The expanding amounts of information created by Internet of Things(IoT)devices places a strain on cloud computing,which is often used for data analysis and storage.This paper investigates a different approach based on edge cloud applications,which involves data filtering and processing before being delivered to a backup cloud environment.This Paper suggest designing and implementing a low cost,low power cluster of Single Board Computers(SBC)for this purpose,reducing the amount of data that must be transmitted elsewhere,using Big Data ideas and technology.An Apache Hadoop and Spark Cluster that was used to run a test application was containerized and deployed using a Raspberry Pi cluster and Docker.To obtain system data and analyze the setup’s performance a Prometheusbased stack monitoring and alerting solution in the cloud based market is employed.This Paper assesses the system’s complexity and demonstrates how containerization can improve fault tolerance and maintenance ease,allowing the suggested solution to be used in industry.An evaluation of the overall performance is presented to highlight the capabilities and limitations of the suggested architecture,taking into consideration the suggested solution’s resource use in respect to device restrictions.
基金supported by the National Natural Science Foundation of China (No. 60673024)the "Eleventh Five" Preliminary Research Project of PLA (No. 102060206)
文摘Most stream data classification algorithms apply the supervised learning strategy which requires massive labeled data.Such approaches are impractical since labeled data are usually hard to obtain in reality.In this paper,we build a clustering feature decision tree model,CFDT,from data streams having both unlabeled and a small number of labeled examples.CFDT applies a micro-clustering algorithm that scans the data only once to provide the statistical summaries of the data for incremental decision tree induction.Micro-clusters also serve as classifiers in tree leaves to improve classification accuracy and reinforce the any-time property.Our experiments on synthetic and real-world datasets show that CFDT is highly scalable for data streams while gener-ating high classification accuracy with high speed.
文摘One recent area of interest in computer science is data stream management and processing. By ‘data stream', we refer to continuous and rapidly generated packages of data. Specific features of data streams are immense volume, high production rate, limited data processing time, and data concept drift; these features differentiate the data stream from standard types of data. An issue for the data stream is classification of input data. A novel ensemble classifier is proposed in this paper. The classifier uses base classifiers of two weighting functions under different data input conditions. In addition, a new method is used to determine drift, which emphasizes the precision of the algorithm. Another characteristic of the proposed method is removal of different numbers of the base classifiers based on their quality. Implementation of a weighting mechanism to the base classifiers at the decision-making stage is another advantage of the algorithm. This facilitates adaptability when drifts take place, which leads to classifiers with higher efficiency. Furthermore, the proposed method is tested on a set of standard data and the results confirm higher accuracy compared to available ensemble classifiers and single classifiers. In addition, in some cases the proposed classifier is faster and needs less storage space.
基金This work is supported by the National Key Scientific Instrument and Equipment Development Project(2012YQ15008703)The Open Project of Top Key Discipline of Computer Software and Theory in Zhejiang Provincial(ZC323014100)+2 种基金National Science Foundation of China(61104089,61473182)Science and Technology Commission of Shanghai Municipality(11JC1404000,14JC1402200)Shanghai RisingStar Program(13QA1401600).
文摘Online anomaly detection for stream data has been explored recently,where the detector is supposed to be able to perform an accurate and timely judgment for the upcoming observation.However,due to the inherent complex characteristics of stream data,such as quick generation,tremendous volume and dynamic evolution distribution,how to develop an effective online anomaly detection method is a challenge.The main objective of this paper is to propose an adaptive online anomaly detection method for stream data.This is achieved by combining isolation principle with online ensemble learning,which is then optimized by statistic histogram.Three main algorithms are developed,i.e.,online detector building algorithm,anomaly detecting algorithm and adaptive detector updating algorithm.To evaluate our proposed method,four massive datasets from the UCI machine learning repository recorded from real events were adopted.Extensive simulations based on these datasets show that our method is effective and robust against different scenarios.
基金supported by the National Natural Science Foundation of China (Nos. U19A2081 and 61802270)the Fundamental Research Funds for the Central Universities (No. 2020SCUNG129)。
文摘The prevalence of missing values in the data streams collected in real environments makes them impossible to ignore in the privacy preservation of data streams.However,the development of most privacy preservation methods does not consider missing values.A few researches allow them to participate in data anonymization but introduce extra considerable information loss.To balance the utility and privacy preservation of incomplete data streams,we present a utility-enhanced approach for Incomplete Data strEam Anonymization(IDEA).In this approach,a slide-window-based processing framework is introduced to anonymize data streams continuously,in which each tuple can be output with clustering or anonymized clusters.We consider the dimensions of attribute and tuple as the similarity measurement,which enables the clustering between incomplete records and complete records and generates the cluster with minimal information loss.To avoid the missing value pollution,we propose a generalization method that is based on maybe match for generalizing incomplete data.The experiments conducted on real datasets show that the proposed approach can efficiently anonymize incomplete data streams while effectively preserving utility.
基金European Commission and Generalitat Valenciana government[ACIF/2012/112]and[BEFPI/2014/067].
文摘Pushed by the Internet of Things(IoT)paradigm modern sensor networks monitor a wide range of phenomena,in areas such as environmental monitoring,health care,industrial processes,and smart cities.These networks provide a continuous pulse of the almost infinite activities that are happening in the physical space and are thus,key enablers for a Digital Earth Nervous System.Nevertheless,the rapid processing of these sensor data streams still continues to challenge traditional data-handling solutions and new approaches are being requested.We propose a generic answer to this challenge,which has the potential to support any form of distributed real-time analysis.This neutral methodology follows a brokering approach to work with different kinds of data sources and uses web-based standards to achieve interoperability.As a proof of concept,we implemented the methodology to detect anomalies in real-time and applied it to the area of environmental monitoring.The developed system is capable of detecting anomalies,generating notifications,and displaying the recent situation to the user.