Data stream clustering is integral to contemporary big data applications.However,addressing the ongoing influx of data streams efficiently and accurately remains a primary challenge in current research.This paper aims...Data stream clustering is integral to contemporary big data applications.However,addressing the ongoing influx of data streams efficiently and accurately remains a primary challenge in current research.This paper aims to elevate the efficiency and precision of data stream clustering,leveraging the TEDA(Typicality and Eccentricity Data Analysis)algorithm as a foundation,we introduce improvements by integrating a nearest neighbor search algorithm to enhance both the efficiency and accuracy of the algorithm.The original TEDA algorithm,grounded in the concept of“Typicality and Eccentricity Data Analytics”,represents an evolving and recursive method that requires no prior knowledge.While the algorithm autonomously creates and merges clusters as new data arrives,its efficiency is significantly hindered by the need to traverse all existing clusters upon the arrival of further data.This work presents the NS-TEDA(Neighbor Search Based Typicality and Eccentricity Data Analysis)algorithm by incorporating a KD-Tree(K-Dimensional Tree)algorithm integrated with the Scapegoat Tree.Upon arrival,this ensures that new data points interact solely with clusters in very close proximity.This significantly enhances algorithm efficiency while preventing a single data point from joining too many clusters and mitigating the merging of clusters with high overlap to some extent.We apply the NS-TEDA algorithm to several well-known datasets,comparing its performance with other data stream clustering algorithms and the original TEDA algorithm.The results demonstrate that the proposed algorithm achieves higher accuracy,and its runtime exhibits almost linear dependence on the volume of data,making it more suitable for large-scale data stream analysis research.展开更多
Clustering high dimensional data is challenging as data dimensionality increases the distance between data points,resulting in sparse regions that degrade clustering performance.Subspace clustering is a common approac...Clustering high dimensional data is challenging as data dimensionality increases the distance between data points,resulting in sparse regions that degrade clustering performance.Subspace clustering is a common approach for processing high-dimensional data by finding relevant features for each cluster in the data space.Subspace clustering methods extend traditional clustering to account for the constraints imposed by data streams.Data streams are not only high-dimensional,but also unbounded and evolving.This necessitates the development of subspace clustering algorithms that can handle high dimensionality and adapt to the unique characteristics of data streams.Although many articles have contributed to the literature review on data stream clustering,there is currently no specific review on subspace clustering algorithms in high-dimensional data streams.Therefore,this article aims to systematically review the existing literature on subspace clustering of data streams in high-dimensional streaming environments.The review follows a systematic methodological approach and includes 18 articles for the final analysis.The analysis focused on two research questions related to the general clustering process and dealing with the unbounded and evolving characteristics of data streams.The main findings relate to six elements:clustering process,cluster search,subspace search,synopsis structure,cluster maintenance,and evaluation measures.Most algorithms use a two-phase clustering approach consisting of an initialization stage,a refinement stage,a cluster maintenance stage,and a final clustering stage.The density-based top-down subspace clustering approach is more widely used than the others because it is able to distinguish true clusters and outliers using projected microclusters.Most algorithms implicitly adapt to the evolving nature of the data stream by using a time fading function that is sensitive to outliers.Future work can focus on the clustering framework,parameter optimization,subspace search techniques,memory-efficient synopsis structures,explicit cluster change detection,and intrinsic performance metrics.This article can serve as a guide for researchers interested in high-dimensional subspace clustering methods for data streams.展开更多
Handling sentiment drifts in real time twitter data streams are a challen-ging task while performing sentiment classifications,because of the changes that occur in the sentiments of twitter users,with respect to time....Handling sentiment drifts in real time twitter data streams are a challen-ging task while performing sentiment classifications,because of the changes that occur in the sentiments of twitter users,with respect to time.The growing volume of tweets with sentiment drifts has led to the need for devising an adaptive approach to detect and handle this drift in real time.This work proposes an adap-tive learning algorithm-based framework,Twitter Sentiment Drift Analysis-Bidir-ectional Encoder Representations from Transformers(TSDA-BERT),which introduces a sentiment drift measure to detect drifts and a domain impact score to adaptively retrain the classification model with domain relevant data in real time.The framework also works on static data by converting them to data streams using the Kafka tool.The experiments conducted on real time and simulated tweets of sports,health care andfinancial topics show that the proposed system is able to detect sentiment drifts and maintain the performance of the classification model,with accuracies of 91%,87%and 90%,respectively.Though the results have been provided only for a few topics,as a proof of concept,this framework can be applied to detect sentiment drifts and perform sentiment classification on real time data streams of any topic.展开更多
A new algorithm for clustering multiple data streams is proposed.The algorithm can effectively cluster data streams which show similar behavior with some unknown time delays.The algorithm uses the autoregressive (AR...A new algorithm for clustering multiple data streams is proposed.The algorithm can effectively cluster data streams which show similar behavior with some unknown time delays.The algorithm uses the autoregressive (AR) modeling technique to measure correlations between data streams.It exploits estimated frequencies spectra to extract the essential features of streams.Each stream is represented as the sum of spectral components and the correlation is measured component-wise.Each spectral component is described by four parameters,namely,amplitude,phase,damping rate and frequency.The ε-lag-correlation between two spectral components is calculated.The algorithm uses such information as similarity measures in clustering data streams.Based on a sliding window model,the algorithm can continuously report the most recent clustering results and adjust the number of clusters.Experiments on real and synthetic streams show that the proposed clustering method has a higher speed and clustering quality than other similar methods.展开更多
In order to avoid the redundant and inconsistent information in distributed data streams, a sampling method based on min-wise hash functions is designed and the practical semantics of the union of distributed data str...In order to avoid the redundant and inconsistent information in distributed data streams, a sampling method based on min-wise hash functions is designed and the practical semantics of the union of distributed data streams is defined. First, for each family of min-wise hash functions, the data with the minimum hash value are selected as local samples and the biased effect caused by frequent updates in a single data stream is filtered out. Secondly, for the same hash function, the sample with the minimum hash value is selected as the global sample and the local samples are combined at the center node to filter out the biased effect of duplicated updates. Finally, based on the obtained uniform samples, several aggregations on the defined semantics of the union of data streams are precisely estimated. The results of comparison tests on synthetic and real-life data streams demonstrate the effectiveness of this method.展开更多
In order to improve the precision of super point detection and control measurement resource consumption, this paper proposes a super point detection method based on sampling and data streaming algorithms (SDSD), and...In order to improve the precision of super point detection and control measurement resource consumption, this paper proposes a super point detection method based on sampling and data streaming algorithms (SDSD), and proves that only sources or destinations with a lot of flows can be sampled probabilistically using the SDSD algorithm. The SDSD algorithm uses both the IP table and the flow bloom filter (BF) data structures to maintain the IP and flow information. The IP table is used to judge whether an IP address has been recorded. If the IP exists, then all its subsequent flows will be recorded into the flow BF; otherwise, the IP flow is sampled. This paper also analyzes the accuracy and memory requirements of the SDSD algorithm , and tests them using the CERNET trace. The theoretical analysis and experimental tests demonstrate that the most relative errors of the super points estimated by the SDSD algorithm are less than 5%, whereas the results of other algorithms are about 10%. Because of the BF structure, the SDSD algorithm is also better than previous algorithms in terms of memory consumption.展开更多
Edge-computing-enabled smart greenhouses are a representative application of the Internet of Things(IoT)technology,which can monitor the environmental information in real-time and employ the information to contribute ...Edge-computing-enabled smart greenhouses are a representative application of the Internet of Things(IoT)technology,which can monitor the environmental information in real-time and employ the information to contribute to intelligent decision-making.In the process,anomaly detection for wireless sensor data plays an important role.However,the traditional anomaly detection algorithms originally designed for anomaly detection in static data do not properly consider the inherent characteristics of the data stream produced by wireless sensors such as infiniteness,correlations,and concept drift,which may pose a considerable challenge to anomaly detection based on data stream and lead to low detection accuracy and efficiency.First,the data stream is usually generated quickly,which means that the data stream is infinite and enormous.Hence,any traditional off-line anomaly detection algorithm that attempts to store the whole dataset or to scan the dataset multiple times for anomaly detection will run out of memory space.Second,there exist correlations among different data streams,and traditional algorithms hardly consider these correlations.Third,the underlying data generation process or distribution may change over time.Thus,traditional anomaly detection algorithms with no model update will lose their effects.Considering these issues,a novel method(called DLSHiForest)based on Locality-Sensitive Hashing and the time window technique is proposed to solve these problems while achieving accurate and efficient detection.Comprehensive experiments are executed using a real-world agricultural greenhouse dataset to demonstrate the feasibility of our approach.Experimental results show that our proposal is practical for addressing the challenges of traditional anomaly detection while ensuring accuracy and efficiency.展开更多
Data aggregation from various web sources is very significant for web data analysis domain. In ad- dition, the recognition of coherence micro cluster is one of the most interesting issues in the field of data aggregat...Data aggregation from various web sources is very significant for web data analysis domain. In ad- dition, the recognition of coherence micro cluster is one of the most interesting issues in the field of data aggregation. Until now, many algorithms have been proposed to work on this issue. However, the deficiency of these solutions is that they cannot recognize the micro-cluster data stream accurately. A semantic-based coherent micro-cluster recognition algorithm for hybrid web data stream is nronosed.Firstly, an objective function is proposed to recognize the coherence micro-cluster and then the coher- ence micro-cluster recognition algorithm for hybrid web data stream based on semantic is raised. Fi-展开更多
This paper presents two one-pass algorithms for dynamically computing frequency counts in sliding window over a data stream-computing frequency counts exceeding user-specified threshold ε. The first algorithm constru...This paper presents two one-pass algorithms for dynamically computing frequency counts in sliding window over a data stream-computing frequency counts exceeding user-specified threshold ε. The first algorithm constructs subwindows and deletes expired sub-windows periodically in sliding window, and each sub-window maintains a summary data structure. The first algorithm outputs at most 1/ε + 1 elements for frequency queries over the most recent N elements. The second algorithm adapts multiple levels method to deal with data stream. Once the sketch of the most recent N elements has been constructed, the second algorithm can provides the answers to the frequency queries over the most recent n ( n≤N) elements. The second algorithm outputs at most 1/ε + 2 elements. The analytical and experimental results show that our algorithms are accurate and effective.展开更多
The rapid developments in the fields of telecommunication, sensor data, financial applications, analyzing of data streams, and so on, increase the rate of data arrival, among which the data mining technique is conside...The rapid developments in the fields of telecommunication, sensor data, financial applications, analyzing of data streams, and so on, increase the rate of data arrival, among which the data mining technique is considered a vital process. The data analysis process consists of different tasks, among which the data stream classification approaches face more challenges than the other commonly used techniques. Even though the classification is a continuous process, it requires a design that can adapt the classification model so as to adjust the concept change or the boundary change between the classes. Hence, we design a novel fuzzy classifier known as THRFuzzy to classify new incoming data streams. Rough set theory along with tangential holoentropy function helps in the designing the dynamic classification model. The classification approach uses kernel fuzzy c-means(FCM) clustering for the generation of the rules and tangential holoentropy function to update the membership function. The performance of the proposed THRFuzzy method is verified using three datasets, namely skin segmentation, localization, and breast cancer datasets, and the evaluated metrics, accuracy and time, comparing its performance with HRFuzzy and adaptive k-NN classifiers. The experimental results conclude that THRFuzzy classifier shows better classification results providing a maximum accuracy consuming a minimal time than the existing classifiers.展开更多
This paper focuses on the parallel aggregation processing of data streams based on the shared-nothing architecture. A novel granularity-aware parallel aggregating model is proposed. It employs parallel sampling and li...This paper focuses on the parallel aggregation processing of data streams based on the shared-nothing architecture. A novel granularity-aware parallel aggregating model is proposed. It employs parallel sampling and linear regression to describe the characteristics of the data quantity in the query window in order to determine the partition granularity of tuples, and utilizes equal depth histogram to implement partitio ning. This method can avoid data skew and reduce communi cation cost. The experiment results on both synthetic data and actual data prove that the proposed method is efficient, practical and suitable for time-varying data streams processing.展开更多
How to process aggregate queries over data streams efficiently and effectively have been becoming hot re search topics in both academic community and industrial community. Aiming at the issues, a novel Linked-tree alg...How to process aggregate queries over data streams efficiently and effectively have been becoming hot re search topics in both academic community and industrial community. Aiming at the issues, a novel Linked-tree algorithm based on sliding window is proposed in this paper. Due to the proposal of concept area, the Linked-tree algorithm reuses many primary results in last window and then avoids lots of unnecessary repeated comparison operations between two successive windows. As a result, execution efficiency of MAX query is improved dramatically. In addition, since the size of memory is relevant to the number of areas but irrelevant to the size of sliding window, memory is economized greatly. The extensive experimental results show that the performance of Linked-tree algorithm has significant improvement gains over the traditional SC (Simple Compared) algorithm and Ranked-tree algorithm.展开更多
Processing a join over unbounded input streams requires unbounded memory, since every tuple in one infinite stream must be compared with every tuple in the other. In fact, most join queries over unbounded input stream...Processing a join over unbounded input streams requires unbounded memory, since every tuple in one infinite stream must be compared with every tuple in the other. In fact, most join queries over unbounded input streams are restricted to finite memory due to sliding window constraints. So far, non-indexed and indexed stream equijoin algorithms based on sliding windows have been proposed in many literatures. However, none of them takes non-equijoin into consideration. In many eases, non-equijoin queries occur frequently. Hence, it is worth to discuss how to process non-equijoin queries effectively and efficiently. In this paper, we propose an indexed join algorithm for supporting non-equijoin queries. The experimental results show that our indexed non-equijoin techniques are more efficient than those without index.展开更多
Logistic regression is a fast classifier and can achieve higher accuracy on small training data.Moreover,it can work on both discrete and continuous attributes with nonlinear patterns.Based on these properties of logi...Logistic regression is a fast classifier and can achieve higher accuracy on small training data.Moreover,it can work on both discrete and continuous attributes with nonlinear patterns.Based on these properties of logistic regression,this paper proposed an algorithm,called evolutionary logistical regression classifier(ELRClass),to solve the classification of evolving data streams.This algorithm applies logistic regression repeatedly to a sliding window of samples in order to update the existing classifier,to keep this classifier if its performance is deteriorated by the reason of bursting noise,or to construct a new classifier if a major concept drift is detected.The intensive experimental results demonstrate the effectiveness of this algorithm.展开更多
At this current time,data stream classification plays a key role in big data analytics due to its enormous growth.Most of the existing classification methods used ensemble learning,which is trustworthy but these metho...At this current time,data stream classification plays a key role in big data analytics due to its enormous growth.Most of the existing classification methods used ensemble learning,which is trustworthy but these methods are not effective to face the issues of learning from imbalanced big data,it also supposes that all data are pre-classified.Another weakness of current methods is that it takes a long evaluation time when the target data stream contains a high number of features.The main objective of this research is to develop a new method for incremental learning based on the proposed ant lion fuzzy-generative adversarial network model.The proposed model is implemented in spark architecture.For each data stream,the class output is computed at slave nodes by training a generative adversarial network with the back propagation error based on fuzzy bound computation.This method overcomes the limitations of existing methods as it can classify data streams that are slightly or completely unlabeled data and providing high scalability and efficiency.The results show that the proposed model outperforms stateof-the-art performance in terms of accuracy(0.861)precision(0.9328)and minimal MSE(0.0416).展开更多
In this paper, a new algorithm HCOUNT + is proposed to find frequent items over data stream based on the HCOUNT algorithm. The new algorithm adopts aided measures to improve the precision of HCOUNT greatly. In additi...In this paper, a new algorithm HCOUNT + is proposed to find frequent items over data stream based on the HCOUNT algorithm. The new algorithm adopts aided measures to improve the precision of HCOUNT greatly. In addition,HCOUNT + is introduced to time critical applications and a novel sliding windows-based algorithm SL-HCOUNT + is proposed to mine the most frequent items occurring recently.This algorithm uses limited memory (nB · (1 +α) · e/ε·In(-M/lnρ)(α〈1) counters), requires constant processing time per packet (only (1+α) · ln(-M/lnρ(α〈1)) counters are updated), makes only one pass over the streaming data,and is shown to work well in the experimental results.展开更多
Previous weighted frequent pattern (WFP) mining algorithms are not suitable for data streams for they need multiple database scans. In this paper, we present an efficient algorithm SWFP-Miner to mine weighted freque...Previous weighted frequent pattern (WFP) mining algorithms are not suitable for data streams for they need multiple database scans. In this paper, we present an efficient algorithm SWFP-Miner to mine weighted frequent pattern over data streams. SWFP-Miner is based on sliding window and can discover important frequent pattern from the recent data. A new refined weight definition is proposed to keep the downward closure property, and two pruning strategies are presented to prune the weighted infrequent pattern. Experimental studies are performed to evaluate the effectiveness and efficiency of SWFP-Miner.展开更多
Microservices have become popular in enterprises because of their excellent scalability and timely update capabilities.However,while fine-grained modularity and service-orientation decrease the complexity of system de...Microservices have become popular in enterprises because of their excellent scalability and timely update capabilities.However,while fine-grained modularity and service-orientation decrease the complexity of system development,the complexity of system operation and maintenance has been greatly increased,on the contrary.Multiple types of system failures occur frequently,and it is hard to detect and diagnose failures in time.Furthermore,microservices are updated frequently.Existing anomaly detection models depend on offline training and cannot adapt to the frequent updates of microservices.This paper proposes an anomaly detection approach for microservice systems with multi-source data streams.This approach realizes online model construction and online anomaly detection,and is capable of self-updating and self-adapting.Experimental results show that this approach can correctly identify 78.85%of faults of different types.展开更多
The analytical capacity of massive data has become increasingly necessary, given the high volume of data that has been generated daily by different sources. The data sources are varied and can generate a huge amount o...The analytical capacity of massive data has become increasingly necessary, given the high volume of data that has been generated daily by different sources. The data sources are varied and can generate a huge amount of data, which can be processed in batch or stream settings. The stream setting corresponds to the treatment of a continuous sequence of data that arrives in real-time flow and needs to be processed in real-time. The models, tools, methods and algorithms for generating intelligence from data stream culminate in the approaches of Data Stream Mining and Data Stream Learning. The activities of such approaches can be organized and structured according to Engineering principles, thus allowing the principles of Analytical Engineering, or more specifically, Analytical Engineering for Data Stream (AEDS). Thus, this article presents the AEDS conceptual framework composed of four pillars (Data, Model, Tool, People) and three processes (Acquisition, Retention, Review). The definition of these pillars and processes is carried out based on the main components of data stream setting, corresponding to four pillars, and also on the necessity to operationalize the activities of an Analytical Organization (AO) in the use of AEDS four pillars, which determines the three proposed processes. The AEDS framework favors the projects carried out in an AO, that is, its Analytical Projects (AP), to favor the delivery of results, or Analytical Deliverables (AD), carried out by the Analytical Teams (AT) in order to provide intelligence from stream data.展开更多
The distance-based outlier detection method detects the implied outliers by calculating the distance of the points in the dataset, but the computational complexity is particularly high when processing multidimensional...The distance-based outlier detection method detects the implied outliers by calculating the distance of the points in the dataset, but the computational complexity is particularly high when processing multidimensional datasets. In addition, the traditional outlier detection method does not consider the frequency of subsets occurrence, thus, the detected outliers do not fit the definition of outliers (i.e., rarely appearing). The pattern mining-based outlier detection approaches have solved this problem, but the importance of each pattern is not taken into account in outlier detection process, so the detected outliers cannot truly reflect some actual situation. Aimed at these problems, a two-phase minimal weighted rare pattern mining-based outlier detection approach, called MWRPM-Outlier, is proposed to effectively detect outliers on the weight data stream. In particular, a method called MWRPM is proposed in the pattern mining phase to fast mine the minimal weighted rare patterns, and then two deviation factors are defined in outlier detection phase to measure the abnormal degree of each transaction on the weight data stream. Experimental results show that the proposed MWRPM-Outlier approach has excellent performance in outlier detection and MWRPM approach outperforms in weighted rare pattern mining.展开更多
基金This research was funded by the National Natural Science Foundation of China(Grant No.72001190)by the Ministry of Education’s Humanities and Social Science Project via the China Ministry of Education(Grant No.20YJC630173)by Zhejiang A&F University(Grant No.2022LFR062).
文摘Data stream clustering is integral to contemporary big data applications.However,addressing the ongoing influx of data streams efficiently and accurately remains a primary challenge in current research.This paper aims to elevate the efficiency and precision of data stream clustering,leveraging the TEDA(Typicality and Eccentricity Data Analysis)algorithm as a foundation,we introduce improvements by integrating a nearest neighbor search algorithm to enhance both the efficiency and accuracy of the algorithm.The original TEDA algorithm,grounded in the concept of“Typicality and Eccentricity Data Analytics”,represents an evolving and recursive method that requires no prior knowledge.While the algorithm autonomously creates and merges clusters as new data arrives,its efficiency is significantly hindered by the need to traverse all existing clusters upon the arrival of further data.This work presents the NS-TEDA(Neighbor Search Based Typicality and Eccentricity Data Analysis)algorithm by incorporating a KD-Tree(K-Dimensional Tree)algorithm integrated with the Scapegoat Tree.Upon arrival,this ensures that new data points interact solely with clusters in very close proximity.This significantly enhances algorithm efficiency while preventing a single data point from joining too many clusters and mitigating the merging of clusters with high overlap to some extent.We apply the NS-TEDA algorithm to several well-known datasets,comparing its performance with other data stream clustering algorithms and the original TEDA algorithm.The results demonstrate that the proposed algorithm achieves higher accuracy,and its runtime exhibits almost linear dependence on the volume of data,making it more suitable for large-scale data stream analysis research.
文摘Clustering high dimensional data is challenging as data dimensionality increases the distance between data points,resulting in sparse regions that degrade clustering performance.Subspace clustering is a common approach for processing high-dimensional data by finding relevant features for each cluster in the data space.Subspace clustering methods extend traditional clustering to account for the constraints imposed by data streams.Data streams are not only high-dimensional,but also unbounded and evolving.This necessitates the development of subspace clustering algorithms that can handle high dimensionality and adapt to the unique characteristics of data streams.Although many articles have contributed to the literature review on data stream clustering,there is currently no specific review on subspace clustering algorithms in high-dimensional data streams.Therefore,this article aims to systematically review the existing literature on subspace clustering of data streams in high-dimensional streaming environments.The review follows a systematic methodological approach and includes 18 articles for the final analysis.The analysis focused on two research questions related to the general clustering process and dealing with the unbounded and evolving characteristics of data streams.The main findings relate to six elements:clustering process,cluster search,subspace search,synopsis structure,cluster maintenance,and evaluation measures.Most algorithms use a two-phase clustering approach consisting of an initialization stage,a refinement stage,a cluster maintenance stage,and a final clustering stage.The density-based top-down subspace clustering approach is more widely used than the others because it is able to distinguish true clusters and outliers using projected microclusters.Most algorithms implicitly adapt to the evolving nature of the data stream by using a time fading function that is sensitive to outliers.Future work can focus on the clustering framework,parameter optimization,subspace search techniques,memory-efficient synopsis structures,explicit cluster change detection,and intrinsic performance metrics.This article can serve as a guide for researchers interested in high-dimensional subspace clustering methods for data streams.
文摘Handling sentiment drifts in real time twitter data streams are a challen-ging task while performing sentiment classifications,because of the changes that occur in the sentiments of twitter users,with respect to time.The growing volume of tweets with sentiment drifts has led to the need for devising an adaptive approach to detect and handle this drift in real time.This work proposes an adap-tive learning algorithm-based framework,Twitter Sentiment Drift Analysis-Bidir-ectional Encoder Representations from Transformers(TSDA-BERT),which introduces a sentiment drift measure to detect drifts and a domain impact score to adaptively retrain the classification model with domain relevant data in real time.The framework also works on static data by converting them to data streams using the Kafka tool.The experiments conducted on real time and simulated tweets of sports,health care andfinancial topics show that the proposed system is able to detect sentiment drifts and maintain the performance of the classification model,with accuracies of 91%,87%and 90%,respectively.Though the results have been provided only for a few topics,as a proof of concept,this framework can be applied to detect sentiment drifts and perform sentiment classification on real time data streams of any topic.
基金The National Natural Science Foundation of China(No.60673060)the Natural Science Foundation of Jiangsu Province(No.BK2005047)
文摘A new algorithm for clustering multiple data streams is proposed.The algorithm can effectively cluster data streams which show similar behavior with some unknown time delays.The algorithm uses the autoregressive (AR) modeling technique to measure correlations between data streams.It exploits estimated frequencies spectra to extract the essential features of streams.Each stream is represented as the sum of spectral components and the correlation is measured component-wise.Each spectral component is described by four parameters,namely,amplitude,phase,damping rate and frequency.The ε-lag-correlation between two spectral components is calculated.The algorithm uses such information as similarity measures in clustering data streams.Based on a sliding window model,the algorithm can continuously report the most recent clustering results and adjust the number of clusters.Experiments on real and synthetic streams show that the proposed clustering method has a higher speed and clustering quality than other similar methods.
基金The National Natural Science Foundation of China(No60973023,60603040)the Natural Science Foundation of Southeast University(NoKJ2009362)
文摘In order to avoid the redundant and inconsistent information in distributed data streams, a sampling method based on min-wise hash functions is designed and the practical semantics of the union of distributed data streams is defined. First, for each family of min-wise hash functions, the data with the minimum hash value are selected as local samples and the biased effect caused by frequent updates in a single data stream is filtered out. Secondly, for the same hash function, the sample with the minimum hash value is selected as the global sample and the local samples are combined at the center node to filter out the biased effect of duplicated updates. Finally, based on the obtained uniform samples, several aggregations on the defined semantics of the union of data streams are precisely estimated. The results of comparison tests on synthetic and real-life data streams demonstrate the effectiveness of this method.
基金The National Basic Research Program of China(973Program)(No.2009CB320505)the Natural Science Foundation of Jiangsu Province(No. BK2008288)+1 种基金the Excellent Young Teachers Program of Southeast University(No.4009001018)the Open Research Program of Key Laboratory of Computer Network of Guangdong Province (No. CCNL200706)
文摘In order to improve the precision of super point detection and control measurement resource consumption, this paper proposes a super point detection method based on sampling and data streaming algorithms (SDSD), and proves that only sources or destinations with a lot of flows can be sampled probabilistically using the SDSD algorithm. The SDSD algorithm uses both the IP table and the flow bloom filter (BF) data structures to maintain the IP and flow information. The IP table is used to judge whether an IP address has been recorded. If the IP exists, then all its subsequent flows will be recorded into the flow BF; otherwise, the IP flow is sampled. This paper also analyzes the accuracy and memory requirements of the SDSD algorithm , and tests them using the CERNET trace. The theoretical analysis and experimental tests demonstrate that the most relative errors of the super points estimated by the SDSD algorithm are less than 5%, whereas the results of other algorithms are about 10%. Because of the BF structure, the SDSD algorithm is also better than previous algorithms in terms of memory consumption.
基金supported in part by the Fundamental Research Funds for the Central Universities under Grant No.30919011282.
文摘Edge-computing-enabled smart greenhouses are a representative application of the Internet of Things(IoT)technology,which can monitor the environmental information in real-time and employ the information to contribute to intelligent decision-making.In the process,anomaly detection for wireless sensor data plays an important role.However,the traditional anomaly detection algorithms originally designed for anomaly detection in static data do not properly consider the inherent characteristics of the data stream produced by wireless sensors such as infiniteness,correlations,and concept drift,which may pose a considerable challenge to anomaly detection based on data stream and lead to low detection accuracy and efficiency.First,the data stream is usually generated quickly,which means that the data stream is infinite and enormous.Hence,any traditional off-line anomaly detection algorithm that attempts to store the whole dataset or to scan the dataset multiple times for anomaly detection will run out of memory space.Second,there exist correlations among different data streams,and traditional algorithms hardly consider these correlations.Third,the underlying data generation process or distribution may change over time.Thus,traditional anomaly detection algorithms with no model update will lose their effects.Considering these issues,a novel method(called DLSHiForest)based on Locality-Sensitive Hashing and the time window technique is proposed to solve these problems while achieving accurate and efficient detection.Comprehensive experiments are executed using a real-world agricultural greenhouse dataset to demonstrate the feasibility of our approach.Experimental results show that our proposal is practical for addressing the challenges of traditional anomaly detection while ensuring accuracy and efficiency.
基金Supported by the National High Technology Research and Development Programme of China(No.2011AA120300,2011AA120302)the National Key Technology Support Program of China(No.2013BAH66F02)
文摘Data aggregation from various web sources is very significant for web data analysis domain. In ad- dition, the recognition of coherence micro cluster is one of the most interesting issues in the field of data aggregation. Until now, many algorithms have been proposed to work on this issue. However, the deficiency of these solutions is that they cannot recognize the micro-cluster data stream accurately. A semantic-based coherent micro-cluster recognition algorithm for hybrid web data stream is nronosed.Firstly, an objective function is proposed to recognize the coherence micro-cluster and then the coher- ence micro-cluster recognition algorithm for hybrid web data stream based on semantic is raised. Fi-
基金Supported by the National Natural Science Foun-dation of China (60403027)
文摘This paper presents two one-pass algorithms for dynamically computing frequency counts in sliding window over a data stream-computing frequency counts exceeding user-specified threshold ε. The first algorithm constructs subwindows and deletes expired sub-windows periodically in sliding window, and each sub-window maintains a summary data structure. The first algorithm outputs at most 1/ε + 1 elements for frequency queries over the most recent N elements. The second algorithm adapts multiple levels method to deal with data stream. Once the sketch of the most recent N elements has been constructed, the second algorithm can provides the answers to the frequency queries over the most recent n ( n≤N) elements. The second algorithm outputs at most 1/ε + 2 elements. The analytical and experimental results show that our algorithms are accurate and effective.
基金supported by proposal No.OSD/BCUD/392/197 Board of Colleges and University Development,Savitribai Phule Pune University,Pune
文摘The rapid developments in the fields of telecommunication, sensor data, financial applications, analyzing of data streams, and so on, increase the rate of data arrival, among which the data mining technique is considered a vital process. The data analysis process consists of different tasks, among which the data stream classification approaches face more challenges than the other commonly used techniques. Even though the classification is a continuous process, it requires a design that can adapt the classification model so as to adjust the concept change or the boundary change between the classes. Hence, we design a novel fuzzy classifier known as THRFuzzy to classify new incoming data streams. Rough set theory along with tangential holoentropy function helps in the designing the dynamic classification model. The classification approach uses kernel fuzzy c-means(FCM) clustering for the generation of the rules and tangential holoentropy function to update the membership function. The performance of the proposed THRFuzzy method is verified using three datasets, namely skin segmentation, localization, and breast cancer datasets, and the evaluated metrics, accuracy and time, comparing its performance with HRFuzzy and adaptive k-NN classifiers. The experimental results conclude that THRFuzzy classifier shows better classification results providing a maximum accuracy consuming a minimal time than the existing classifiers.
基金Supported by Foundation of High Technology Pro-ject of Jiangsu (BG2004034) , Foundation of Graduate Creative Pro-gramof Jiangsu (xm04-36)
文摘This paper focuses on the parallel aggregation processing of data streams based on the shared-nothing architecture. A novel granularity-aware parallel aggregating model is proposed. It employs parallel sampling and linear regression to describe the characteristics of the data quantity in the query window in order to determine the partition granularity of tuples, and utilizes equal depth histogram to implement partitio ning. This method can avoid data skew and reduce communi cation cost. The experiment results on both synthetic data and actual data prove that the proposed method is efficient, practical and suitable for time-varying data streams processing.
基金Supported by the National Natural Science Foun-dation of China (60573089) the National 985 Project Fundation(985-2-DB-Y01)
文摘How to process aggregate queries over data streams efficiently and effectively have been becoming hot re search topics in both academic community and industrial community. Aiming at the issues, a novel Linked-tree algorithm based on sliding window is proposed in this paper. Due to the proposal of concept area, the Linked-tree algorithm reuses many primary results in last window and then avoids lots of unnecessary repeated comparison operations between two successive windows. As a result, execution efficiency of MAX query is improved dramatically. In addition, since the size of memory is relevant to the number of areas but irrelevant to the size of sliding window, memory is economized greatly. The extensive experimental results show that the performance of Linked-tree algorithm has significant improvement gains over the traditional SC (Simple Compared) algorithm and Ranked-tree algorithm.
基金Supported by the National Natural Science Foun-dation of China (60473073)
文摘Processing a join over unbounded input streams requires unbounded memory, since every tuple in one infinite stream must be compared with every tuple in the other. In fact, most join queries over unbounded input streams are restricted to finite memory due to sliding window constraints. So far, non-indexed and indexed stream equijoin algorithms based on sliding windows have been proposed in many literatures. However, none of them takes non-equijoin into consideration. In many eases, non-equijoin queries occur frequently. Hence, it is worth to discuss how to process non-equijoin queries effectively and efficiently. In this paper, we propose an indexed join algorithm for supporting non-equijoin queries. The experimental results show that our indexed non-equijoin techniques are more efficient than those without index.
文摘Logistic regression is a fast classifier and can achieve higher accuracy on small training data.Moreover,it can work on both discrete and continuous attributes with nonlinear patterns.Based on these properties of logistic regression,this paper proposed an algorithm,called evolutionary logistical regression classifier(ELRClass),to solve the classification of evolving data streams.This algorithm applies logistic regression repeatedly to a sliding window of samples in order to update the existing classifier,to keep this classifier if its performance is deteriorated by the reason of bursting noise,or to construct a new classifier if a major concept drift is detected.The intensive experimental results demonstrate the effectiveness of this algorithm.
基金Taif University Researchers Supporting Project Number(TURSP-2020/126),Taif University,Taif,Saudi Arabia.
文摘At this current time,data stream classification plays a key role in big data analytics due to its enormous growth.Most of the existing classification methods used ensemble learning,which is trustworthy but these methods are not effective to face the issues of learning from imbalanced big data,it also supposes that all data are pre-classified.Another weakness of current methods is that it takes a long evaluation time when the target data stream contains a high number of features.The main objective of this research is to develop a new method for incremental learning based on the proposed ant lion fuzzy-generative adversarial network model.The proposed model is implemented in spark architecture.For each data stream,the class output is computed at slave nodes by training a generative adversarial network with the back propagation error based on fuzzy bound computation.This method overcomes the limitations of existing methods as it can classify data streams that are slightly or completely unlabeled data and providing high scalability and efficiency.The results show that the proposed model outperforms stateof-the-art performance in terms of accuracy(0.861)precision(0.9328)and minimal MSE(0.0416).
文摘In this paper, a new algorithm HCOUNT + is proposed to find frequent items over data stream based on the HCOUNT algorithm. The new algorithm adopts aided measures to improve the precision of HCOUNT greatly. In addition,HCOUNT + is introduced to time critical applications and a novel sliding windows-based algorithm SL-HCOUNT + is proposed to mine the most frequent items occurring recently.This algorithm uses limited memory (nB · (1 +α) · e/ε·In(-M/lnρ)(α〈1) counters), requires constant processing time per packet (only (1+α) · ln(-M/lnρ(α〈1)) counters are updated), makes only one pass over the streaming data,and is shown to work well in the experimental results.
文摘Previous weighted frequent pattern (WFP) mining algorithms are not suitable for data streams for they need multiple database scans. In this paper, we present an efficient algorithm SWFP-Miner to mine weighted frequent pattern over data streams. SWFP-Miner is based on sliding window and can discover important frequent pattern from the recent data. A new refined weight definition is proposed to keep the downward closure property, and two pruning strategies are presented to prune the weighted infrequent pattern. Experimental studies are performed to evaluate the effectiveness and efficiency of SWFP-Miner.
基金supported by ZTE Industry-University-Institute Cooperation Funds under Grant No.HF-CN-202008200001。
文摘Microservices have become popular in enterprises because of their excellent scalability and timely update capabilities.However,while fine-grained modularity and service-orientation decrease the complexity of system development,the complexity of system operation and maintenance has been greatly increased,on the contrary.Multiple types of system failures occur frequently,and it is hard to detect and diagnose failures in time.Furthermore,microservices are updated frequently.Existing anomaly detection models depend on offline training and cannot adapt to the frequent updates of microservices.This paper proposes an anomaly detection approach for microservice systems with multi-source data streams.This approach realizes online model construction and online anomaly detection,and is capable of self-updating and self-adapting.Experimental results show that this approach can correctly identify 78.85%of faults of different types.
文摘The analytical capacity of massive data has become increasingly necessary, given the high volume of data that has been generated daily by different sources. The data sources are varied and can generate a huge amount of data, which can be processed in batch or stream settings. The stream setting corresponds to the treatment of a continuous sequence of data that arrives in real-time flow and needs to be processed in real-time. The models, tools, methods and algorithms for generating intelligence from data stream culminate in the approaches of Data Stream Mining and Data Stream Learning. The activities of such approaches can be organized and structured according to Engineering principles, thus allowing the principles of Analytical Engineering, or more specifically, Analytical Engineering for Data Stream (AEDS). Thus, this article presents the AEDS conceptual framework composed of four pillars (Data, Model, Tool, People) and three processes (Acquisition, Retention, Review). The definition of these pillars and processes is carried out based on the main components of data stream setting, corresponding to four pillars, and also on the necessity to operationalize the activities of an Analytical Organization (AO) in the use of AEDS four pillars, which determines the three proposed processes. The AEDS framework favors the projects carried out in an AO, that is, its Analytical Projects (AP), to favor the delivery of results, or Analytical Deliverables (AD), carried out by the Analytical Teams (AT) in order to provide intelligence from stream data.
基金supported by Fundamental Research Funds for the Central Universities (No. 2018XD004)
文摘The distance-based outlier detection method detects the implied outliers by calculating the distance of the points in the dataset, but the computational complexity is particularly high when processing multidimensional datasets. In addition, the traditional outlier detection method does not consider the frequency of subsets occurrence, thus, the detected outliers do not fit the definition of outliers (i.e., rarely appearing). The pattern mining-based outlier detection approaches have solved this problem, but the importance of each pattern is not taken into account in outlier detection process, so the detected outliers cannot truly reflect some actual situation. Aimed at these problems, a two-phase minimal weighted rare pattern mining-based outlier detection approach, called MWRPM-Outlier, is proposed to effectively detect outliers on the weight data stream. In particular, a method called MWRPM is proposed in the pattern mining phase to fast mine the minimal weighted rare patterns, and then two deviation factors are defined in outlier detection phase to measure the abnormal degree of each transaction on the weight data stream. Experimental results show that the proposed MWRPM-Outlier approach has excellent performance in outlier detection and MWRPM approach outperforms in weighted rare pattern mining.