Recently,anomaly detection(AD)in streaming data gained significant attention among research communities due to its applicability in finance,business,healthcare,education,etc.The recent developments of deep learning(DL...Recently,anomaly detection(AD)in streaming data gained significant attention among research communities due to its applicability in finance,business,healthcare,education,etc.The recent developments of deep learning(DL)models find helpful in the detection and classification of anomalies.This article designs an oversampling with an optimal deep learning-based streaming data classification(OS-ODLSDC)model.The aim of the OSODLSDC model is to recognize and classify the presence of anomalies in the streaming data.The proposed OS-ODLSDC model initially undergoes preprocessing step.Since streaming data is unbalanced,support vector machine(SVM)-Synthetic Minority Over-sampling Technique(SVM-SMOTE)is applied for oversampling process.Besides,the OS-ODLSDC model employs bidirectional long short-term memory(Bi LSTM)for AD and classification.Finally,the root means square propagation(RMSProp)optimizer is applied for optimal hyperparameter tuning of the Bi LSTM model.For ensuring the promising performance of the OS-ODLSDC model,a wide-ranging experimental analysis is performed using three benchmark datasets such as CICIDS 2018,KDD-Cup 1999,and NSL-KDD datasets.展开更多
Data stream clustering is integral to contemporary big data applications.However,addressing the ongoing influx of data streams efficiently and accurately remains a primary challenge in current research.This paper aims...Data stream clustering is integral to contemporary big data applications.However,addressing the ongoing influx of data streams efficiently and accurately remains a primary challenge in current research.This paper aims to elevate the efficiency and precision of data stream clustering,leveraging the TEDA(Typicality and Eccentricity Data Analysis)algorithm as a foundation,we introduce improvements by integrating a nearest neighbor search algorithm to enhance both the efficiency and accuracy of the algorithm.The original TEDA algorithm,grounded in the concept of“Typicality and Eccentricity Data Analytics”,represents an evolving and recursive method that requires no prior knowledge.While the algorithm autonomously creates and merges clusters as new data arrives,its efficiency is significantly hindered by the need to traverse all existing clusters upon the arrival of further data.This work presents the NS-TEDA(Neighbor Search Based Typicality and Eccentricity Data Analysis)algorithm by incorporating a KD-Tree(K-Dimensional Tree)algorithm integrated with the Scapegoat Tree.Upon arrival,this ensures that new data points interact solely with clusters in very close proximity.This significantly enhances algorithm efficiency while preventing a single data point from joining too many clusters and mitigating the merging of clusters with high overlap to some extent.We apply the NS-TEDA algorithm to several well-known datasets,comparing its performance with other data stream clustering algorithms and the original TEDA algorithm.The results demonstrate that the proposed algorithm achieves higher accuracy,and its runtime exhibits almost linear dependence on the volume of data,making it more suitable for large-scale data stream analysis research.展开更多
The rapid developments in the fields of telecommunication, sensor data, financial applications, analyzing of data streams, and so on, increase the rate of data arrival, among which the data mining technique is conside...The rapid developments in the fields of telecommunication, sensor data, financial applications, analyzing of data streams, and so on, increase the rate of data arrival, among which the data mining technique is considered a vital process. The data analysis process consists of different tasks, among which the data stream classification approaches face more challenges than the other commonly used techniques. Even though the classification is a continuous process, it requires a design that can adapt the classification model so as to adjust the concept change or the boundary change between the classes. Hence, we design a novel fuzzy classifier known as THRFuzzy to classify new incoming data streams. Rough set theory along with tangential holoentropy function helps in the designing the dynamic classification model. The classification approach uses kernel fuzzy c-means(FCM) clustering for the generation of the rules and tangential holoentropy function to update the membership function. The performance of the proposed THRFuzzy method is verified using three datasets, namely skin segmentation, localization, and breast cancer datasets, and the evaluated metrics, accuracy and time, comparing its performance with HRFuzzy and adaptive k-NN classifiers. The experimental results conclude that THRFuzzy classifier shows better classification results providing a maximum accuracy consuming a minimal time than the existing classifiers.展开更多
A new algorithm for clustering multiple data streams is proposed.The algorithm can effectively cluster data streams which show similar behavior with some unknown time delays.The algorithm uses the autoregressive (AR...A new algorithm for clustering multiple data streams is proposed.The algorithm can effectively cluster data streams which show similar behavior with some unknown time delays.The algorithm uses the autoregressive (AR) modeling technique to measure correlations between data streams.It exploits estimated frequencies spectra to extract the essential features of streams.Each stream is represented as the sum of spectral components and the correlation is measured component-wise.Each spectral component is described by four parameters,namely,amplitude,phase,damping rate and frequency.The ε-lag-correlation between two spectral components is calculated.The algorithm uses such information as similarity measures in clustering data streams.Based on a sliding window model,the algorithm can continuously report the most recent clustering results and adjust the number of clusters.Experiments on real and synthetic streams show that the proposed clustering method has a higher speed and clustering quality than other similar methods.展开更多
At present, multi-se nsor fusion is widely used in object recognition and classification, since this technique can efficiently improve the accuracy and the ability of fault toleranc e. This paper describes a multi-se...At present, multi-se nsor fusion is widely used in object recognition and classification, since this technique can efficiently improve the accuracy and the ability of fault toleranc e. This paper describes a multi-sensor fusion system, which is model-based and used for rotating mechanical failure diagnosis. In the data fusion process, the fuzzy neural network is selected and used for the data fusion at report level. By comparing the experimental results of fault diagnoses based on fusion data wi th that on original separate data,it is shown that the former is more accurate than the latter.展开更多
A novel data streams partitioning method is proposed to resolve problems of range-aggregation continuous queries over parallel streams for power industry.The first step of this method is to parallel sample the data,wh...A novel data streams partitioning method is proposed to resolve problems of range-aggregation continuous queries over parallel streams for power industry.The first step of this method is to parallel sample the data,which is implemented as an extended reservoir-sampling algorithm.A skip factor based on the change ratio of data-values is introduced to describe the distribution characteristics of data-values adaptively.The second step of this method is to partition the fluxes of data streams averagely,which is implemented with two alternative equal-depth histogram generating algorithms that fit the different cases:one for incremental maintenance based on heuristics and the other for periodical updates to generate an approximate partition vector.The experimental results on actual data prove that the method is efficient,practical and suitable for time-varying data streams processing.展开更多
In order to avoid the redundant and inconsistent information in distributed data streams, a sampling method based on min-wise hash functions is designed and the practical semantics of the union of distributed data str...In order to avoid the redundant and inconsistent information in distributed data streams, a sampling method based on min-wise hash functions is designed and the practical semantics of the union of distributed data streams is defined. First, for each family of min-wise hash functions, the data with the minimum hash value are selected as local samples and the biased effect caused by frequent updates in a single data stream is filtered out. Secondly, for the same hash function, the sample with the minimum hash value is selected as the global sample and the local samples are combined at the center node to filter out the biased effect of duplicated updates. Finally, based on the obtained uniform samples, several aggregations on the defined semantics of the union of data streams are precisely estimated. The results of comparison tests on synthetic and real-life data streams demonstrate the effectiveness of this method.展开更多
An intrusion detection (ID) model is proposed based on the fuzzy data mining method. A major difficulty of anomaly ID is that patterns of the normal behavior change with time. In addition, an actual intrusion with a...An intrusion detection (ID) model is proposed based on the fuzzy data mining method. A major difficulty of anomaly ID is that patterns of the normal behavior change with time. In addition, an actual intrusion with a small deviation may match normal patterns. So the intrusion behavior cannot be detected by the detection system.To solve the problem, fuzzy data mining technique is utilized to extract patterns representing the normal behavior of a network. A set of fuzzy association rules mined from the network data are shown as a model of “normal behaviors”. To detect anomalous behaviors, fuzzy association rules are generated from new audit data and the similarity with sets mined from “normal” data is computed. If the similarity values are lower than a threshold value,an alarm is given. Furthermore, genetic algorithms are used to adjust the fuzzy membership functions and to select an appropriate set of features.展开更多
In order to improve the precision of super point detection and control measurement resource consumption, this paper proposes a super point detection method based on sampling and data streaming algorithms (SDSD), and...In order to improve the precision of super point detection and control measurement resource consumption, this paper proposes a super point detection method based on sampling and data streaming algorithms (SDSD), and proves that only sources or destinations with a lot of flows can be sampled probabilistically using the SDSD algorithm. The SDSD algorithm uses both the IP table and the flow bloom filter (BF) data structures to maintain the IP and flow information. The IP table is used to judge whether an IP address has been recorded. If the IP exists, then all its subsequent flows will be recorded into the flow BF; otherwise, the IP flow is sampled. This paper also analyzes the accuracy and memory requirements of the SDSD algorithm , and tests them using the CERNET trace. The theoretical analysis and experimental tests demonstrate that the most relative errors of the super points estimated by the SDSD algorithm are less than 5%, whereas the results of other algorithms are about 10%. Because of the BF structure, the SDSD algorithm is also better than previous algorithms in terms of memory consumption.展开更多
The conventional data envelopment analysis (DEA) measures the relative efficiencies of a set of decision making units with exact values of inputs and outputs. In real-world prob- lems, however, inputs and outputs ty...The conventional data envelopment analysis (DEA) measures the relative efficiencies of a set of decision making units with exact values of inputs and outputs. In real-world prob- lems, however, inputs and outputs typically have some levels of fuzziness. To analyze a decision making unit (DMU) with fuzzy input/output data, previous studies provided the fuzzy DEA model and proposed an associated evaluating approach. Nonetheless, numerous deficiencies must still be improved, including the α- cut approaches, types of fuzzy numbers, and ranking techniques. Moreover, a fuzzy sample DMU still cannot be evaluated for the Fuzzy DEA model. Therefore, this paper proposes a fuzzy DEA model based on sample decision making unit (FSDEA). Five eval- uation approaches and the related algorithm and ranking methods are provided to test the fuzzy sample DMU of the FSDEA model. A numerical experiment is used to demonstrate and compare the results with those obtained using alternative approaches.展开更多
Edge-computing-enabled smart greenhouses are a representative application of the Internet of Things(IoT)technology,which can monitor the environmental information in real-time and employ the information to contribute ...Edge-computing-enabled smart greenhouses are a representative application of the Internet of Things(IoT)technology,which can monitor the environmental information in real-time and employ the information to contribute to intelligent decision-making.In the process,anomaly detection for wireless sensor data plays an important role.However,the traditional anomaly detection algorithms originally designed for anomaly detection in static data do not properly consider the inherent characteristics of the data stream produced by wireless sensors such as infiniteness,correlations,and concept drift,which may pose a considerable challenge to anomaly detection based on data stream and lead to low detection accuracy and efficiency.First,the data stream is usually generated quickly,which means that the data stream is infinite and enormous.Hence,any traditional off-line anomaly detection algorithm that attempts to store the whole dataset or to scan the dataset multiple times for anomaly detection will run out of memory space.Second,there exist correlations among different data streams,and traditional algorithms hardly consider these correlations.Third,the underlying data generation process or distribution may change over time.Thus,traditional anomaly detection algorithms with no model update will lose their effects.Considering these issues,a novel method(called DLSHiForest)based on Locality-Sensitive Hashing and the time window technique is proposed to solve these problems while achieving accurate and efficient detection.Comprehensive experiments are executed using a real-world agricultural greenhouse dataset to demonstrate the feasibility of our approach.Experimental results show that our proposal is practical for addressing the challenges of traditional anomaly detection while ensuring accuracy and efficiency.展开更多
In this paper, we consider the problem of the evaluation of system reliability using statistical data obtained from reliability tests of its elements, in which the lifetimes of elements are described using an exponent...In this paper, we consider the problem of the evaluation of system reliability using statistical data obtained from reliability tests of its elements, in which the lifetimes of elements are described using an exponential distribution. We assume that this lifetime data may be reported imprecisely and that this lack of precision may be described using fuzzy sets. As the direct application of the fuzzy sets methodology leads in this case to very complicated and time consuming calculations, we propose simple approximations of fuzzy numbers using shadowed sets introduced by Pedrycz (1998). The proposed methodology may be simply extended to the case of general lifetime probability distributions.展开更多
The distance-based outlier detection method detects the implied outliers by calculating the distance of the points in the dataset, but the computational complexity is particularly high when processing multidimensional...The distance-based outlier detection method detects the implied outliers by calculating the distance of the points in the dataset, but the computational complexity is particularly high when processing multidimensional datasets. In addition, the traditional outlier detection method does not consider the frequency of subsets occurrence, thus, the detected outliers do not fit the definition of outliers (i.e., rarely appearing). The pattern mining-based outlier detection approaches have solved this problem, but the importance of each pattern is not taken into account in outlier detection process, so the detected outliers cannot truly reflect some actual situation. Aimed at these problems, a two-phase minimal weighted rare pattern mining-based outlier detection approach, called MWRPM-Outlier, is proposed to effectively detect outliers on the weight data stream. In particular, a method called MWRPM is proposed in the pattern mining phase to fast mine the minimal weighted rare patterns, and then two deviation factors are defined in outlier detection phase to measure the abnormal degree of each transaction on the weight data stream. Experimental results show that the proposed MWRPM-Outlier approach has excellent performance in outlier detection and MWRPM approach outperforms in weighted rare pattern mining.展开更多
Clustering is one of the unsupervised learning problems.It is a procedure which partitions data objects into groups.Many algorithms could not overcome the problems of morphology,overlapping and the large number of clu...Clustering is one of the unsupervised learning problems.It is a procedure which partitions data objects into groups.Many algorithms could not overcome the problems of morphology,overlapping and the large number of clusters at the same time.Many scientific communities have used the clustering algorithm from the perspective of density,which is one of the best methods in clustering.This study proposes a density-based spatial clustering of applications with noise(DBSCAN)algorithm based on the selected high-density areas by automatic fuzzy-DBSCAN(AFD)which works with the initialization of two parameters.AFD,by using fuzzy and DBSCAN features,is modeled by the selection of high-density areas and generates two parameters for merging and separating automatically.The two generated parameters provide a state of sub-cluster rules in the Cartesian coordinate system for the dataset.The model overcomes the problems of clustering such as morphology,overlapping,and the number of clusters in a dataset simultaneously.In the experiments,all algorithms are performed on eight data sets with 30 times of running.Three of them are related to overlapping real datasets and the rest are morphologic and synthetic datasets.It is demonstrated that the AFD algorithm outperforms other recently developed clustering algorithms.展开更多
Continuous response of range query on steaming data provides useful information for many practical applications as well as the risk of privacy disclosure.The existing research on differential privacy streaming data pu...Continuous response of range query on steaming data provides useful information for many practical applications as well as the risk of privacy disclosure.The existing research on differential privacy streaming data publication mostly pay close attention to boosting query accuracy,but pay less attention to query efficiency,and ignore the effect of timeliness on data weight.In this paper,we propose an effective algorithm of differential privacy streaming data publication under exponential decay mode.Firstly,by introducing the Fenwick tree to divide and reorganize data items in the stream,we achieve a constant time complexity for inserting a new item and getting the prefix sum.Meanwhile,we achieve time complicity linear to the number of data item for building a tree.After that,we use the advantage of matrix mechanism to deal with relevant queries and reduce the global sensitivity.In addition,we choose proper diagonal matrix further improve the range query accuracy.Finally,considering about exponential decay,every data item is weighted by the decay factor.By putting the Fenwick tree and matrix optimization together,we present complete algorithm for differentiate private real-time streaming data publication.The experiment is designed to compare the algorithm in this paper with similar algorithms for streaming data release in exponential decay.Experimental results show that the algorithm in this paper effectively improve the query efficiency while ensuring the quality of the query.展开更多
This paper presents two one-pass algorithms for dynamically computing frequency counts in sliding window over a data stream-computing frequency counts exceeding user-specified threshold ε. The first algorithm constru...This paper presents two one-pass algorithms for dynamically computing frequency counts in sliding window over a data stream-computing frequency counts exceeding user-specified threshold ε. The first algorithm constructs subwindows and deletes expired sub-windows periodically in sliding window, and each sub-window maintains a summary data structure. The first algorithm outputs at most 1/ε + 1 elements for frequency queries over the most recent N elements. The second algorithm adapts multiple levels method to deal with data stream. Once the sketch of the most recent N elements has been constructed, the second algorithm can provides the answers to the frequency queries over the most recent n ( n≤N) elements. The second algorithm outputs at most 1/ε + 2 elements. The analytical and experimental results show that our algorithms are accurate and effective.展开更多
One of the goals of data collection is preparing for decision-making, so high quality requirement must be satisfied. Rational evaluation of data quality is an effective way to identify data problem in time, and the qu...One of the goals of data collection is preparing for decision-making, so high quality requirement must be satisfied. Rational evaluation of data quality is an effective way to identify data problem in time, and the quality of data after this evaluation is satisfactory with the requirement of decision maker. A fuzzy neural network based research method of data quality evaluation is proposed. First, the criteria for the evaluation of data quality are selected to construct the fuzzy sets of evaluating grades, and then by using the learning ability of NN, the objective evaluation of membership is carried out, which can be used for the effective evaluation of data quality. This research has been used in the platform of 'data report of national compulsory education outlay guarantee' from the Chinese Ministry of Education. This method can be used for the effective evaluation of data quality worldwide, and the data quality situation can be found out more completely, objectively, and in better time by using the method.展开更多
Data aggregation from various web sources is very significant for web data analysis domain. In ad- dition, the recognition of coherence micro cluster is one of the most interesting issues in the field of data aggregat...Data aggregation from various web sources is very significant for web data analysis domain. In ad- dition, the recognition of coherence micro cluster is one of the most interesting issues in the field of data aggregation. Until now, many algorithms have been proposed to work on this issue. However, the deficiency of these solutions is that they cannot recognize the micro-cluster data stream accurately. A semantic-based coherent micro-cluster recognition algorithm for hybrid web data stream is nronosed.Firstly, an objective function is proposed to recognize the coherence micro-cluster and then the coher- ence micro-cluster recognition algorithm for hybrid web data stream based on semantic is raised. Fi-展开更多
文摘Recently,anomaly detection(AD)in streaming data gained significant attention among research communities due to its applicability in finance,business,healthcare,education,etc.The recent developments of deep learning(DL)models find helpful in the detection and classification of anomalies.This article designs an oversampling with an optimal deep learning-based streaming data classification(OS-ODLSDC)model.The aim of the OSODLSDC model is to recognize and classify the presence of anomalies in the streaming data.The proposed OS-ODLSDC model initially undergoes preprocessing step.Since streaming data is unbalanced,support vector machine(SVM)-Synthetic Minority Over-sampling Technique(SVM-SMOTE)is applied for oversampling process.Besides,the OS-ODLSDC model employs bidirectional long short-term memory(Bi LSTM)for AD and classification.Finally,the root means square propagation(RMSProp)optimizer is applied for optimal hyperparameter tuning of the Bi LSTM model.For ensuring the promising performance of the OS-ODLSDC model,a wide-ranging experimental analysis is performed using three benchmark datasets such as CICIDS 2018,KDD-Cup 1999,and NSL-KDD datasets.
基金This research was funded by the National Natural Science Foundation of China(Grant No.72001190)by the Ministry of Education’s Humanities and Social Science Project via the China Ministry of Education(Grant No.20YJC630173)by Zhejiang A&F University(Grant No.2022LFR062).
文摘Data stream clustering is integral to contemporary big data applications.However,addressing the ongoing influx of data streams efficiently and accurately remains a primary challenge in current research.This paper aims to elevate the efficiency and precision of data stream clustering,leveraging the TEDA(Typicality and Eccentricity Data Analysis)algorithm as a foundation,we introduce improvements by integrating a nearest neighbor search algorithm to enhance both the efficiency and accuracy of the algorithm.The original TEDA algorithm,grounded in the concept of“Typicality and Eccentricity Data Analytics”,represents an evolving and recursive method that requires no prior knowledge.While the algorithm autonomously creates and merges clusters as new data arrives,its efficiency is significantly hindered by the need to traverse all existing clusters upon the arrival of further data.This work presents the NS-TEDA(Neighbor Search Based Typicality and Eccentricity Data Analysis)algorithm by incorporating a KD-Tree(K-Dimensional Tree)algorithm integrated with the Scapegoat Tree.Upon arrival,this ensures that new data points interact solely with clusters in very close proximity.This significantly enhances algorithm efficiency while preventing a single data point from joining too many clusters and mitigating the merging of clusters with high overlap to some extent.We apply the NS-TEDA algorithm to several well-known datasets,comparing its performance with other data stream clustering algorithms and the original TEDA algorithm.The results demonstrate that the proposed algorithm achieves higher accuracy,and its runtime exhibits almost linear dependence on the volume of data,making it more suitable for large-scale data stream analysis research.
基金supported by proposal No.OSD/BCUD/392/197 Board of Colleges and University Development,Savitribai Phule Pune University,Pune
文摘The rapid developments in the fields of telecommunication, sensor data, financial applications, analyzing of data streams, and so on, increase the rate of data arrival, among which the data mining technique is considered a vital process. The data analysis process consists of different tasks, among which the data stream classification approaches face more challenges than the other commonly used techniques. Even though the classification is a continuous process, it requires a design that can adapt the classification model so as to adjust the concept change or the boundary change between the classes. Hence, we design a novel fuzzy classifier known as THRFuzzy to classify new incoming data streams. Rough set theory along with tangential holoentropy function helps in the designing the dynamic classification model. The classification approach uses kernel fuzzy c-means(FCM) clustering for the generation of the rules and tangential holoentropy function to update the membership function. The performance of the proposed THRFuzzy method is verified using three datasets, namely skin segmentation, localization, and breast cancer datasets, and the evaluated metrics, accuracy and time, comparing its performance with HRFuzzy and adaptive k-NN classifiers. The experimental results conclude that THRFuzzy classifier shows better classification results providing a maximum accuracy consuming a minimal time than the existing classifiers.
基金The National Natural Science Foundation of China(No.60673060)the Natural Science Foundation of Jiangsu Province(No.BK2005047)
文摘A new algorithm for clustering multiple data streams is proposed.The algorithm can effectively cluster data streams which show similar behavior with some unknown time delays.The algorithm uses the autoregressive (AR) modeling technique to measure correlations between data streams.It exploits estimated frequencies spectra to extract the essential features of streams.Each stream is represented as the sum of spectral components and the correlation is measured component-wise.Each spectral component is described by four parameters,namely,amplitude,phase,damping rate and frequency.The ε-lag-correlation between two spectral components is calculated.The algorithm uses such information as similarity measures in clustering data streams.Based on a sliding window model,the algorithm can continuously report the most recent clustering results and adjust the number of clusters.Experiments on real and synthetic streams show that the proposed clustering method has a higher speed and clustering quality than other similar methods.
文摘At present, multi-se nsor fusion is widely used in object recognition and classification, since this technique can efficiently improve the accuracy and the ability of fault toleranc e. This paper describes a multi-sensor fusion system, which is model-based and used for rotating mechanical failure diagnosis. In the data fusion process, the fuzzy neural network is selected and used for the data fusion at report level. By comparing the experimental results of fault diagnoses based on fusion data wi th that on original separate data,it is shown that the former is more accurate than the latter.
基金The High Technology Research Plan of Jiangsu Prov-ince (No.BG2004034)the Foundation of Graduate Creative Program ofJiangsu Province (No.xm04-36).
文摘A novel data streams partitioning method is proposed to resolve problems of range-aggregation continuous queries over parallel streams for power industry.The first step of this method is to parallel sample the data,which is implemented as an extended reservoir-sampling algorithm.A skip factor based on the change ratio of data-values is introduced to describe the distribution characteristics of data-values adaptively.The second step of this method is to partition the fluxes of data streams averagely,which is implemented with two alternative equal-depth histogram generating algorithms that fit the different cases:one for incremental maintenance based on heuristics and the other for periodical updates to generate an approximate partition vector.The experimental results on actual data prove that the method is efficient,practical and suitable for time-varying data streams processing.
基金The National Natural Science Foundation of China(No60973023,60603040)the Natural Science Foundation of Southeast University(NoKJ2009362)
文摘In order to avoid the redundant and inconsistent information in distributed data streams, a sampling method based on min-wise hash functions is designed and the practical semantics of the union of distributed data streams is defined. First, for each family of min-wise hash functions, the data with the minimum hash value are selected as local samples and the biased effect caused by frequent updates in a single data stream is filtered out. Secondly, for the same hash function, the sample with the minimum hash value is selected as the global sample and the local samples are combined at the center node to filter out the biased effect of duplicated updates. Finally, based on the obtained uniform samples, several aggregations on the defined semantics of the union of data streams are precisely estimated. The results of comparison tests on synthetic and real-life data streams demonstrate the effectiveness of this method.
文摘An intrusion detection (ID) model is proposed based on the fuzzy data mining method. A major difficulty of anomaly ID is that patterns of the normal behavior change with time. In addition, an actual intrusion with a small deviation may match normal patterns. So the intrusion behavior cannot be detected by the detection system.To solve the problem, fuzzy data mining technique is utilized to extract patterns representing the normal behavior of a network. A set of fuzzy association rules mined from the network data are shown as a model of “normal behaviors”. To detect anomalous behaviors, fuzzy association rules are generated from new audit data and the similarity with sets mined from “normal” data is computed. If the similarity values are lower than a threshold value,an alarm is given. Furthermore, genetic algorithms are used to adjust the fuzzy membership functions and to select an appropriate set of features.
基金The National Basic Research Program of China(973Program)(No.2009CB320505)the Natural Science Foundation of Jiangsu Province(No. BK2008288)+1 种基金the Excellent Young Teachers Program of Southeast University(No.4009001018)the Open Research Program of Key Laboratory of Computer Network of Guangdong Province (No. CCNL200706)
文摘In order to improve the precision of super point detection and control measurement resource consumption, this paper proposes a super point detection method based on sampling and data streaming algorithms (SDSD), and proves that only sources or destinations with a lot of flows can be sampled probabilistically using the SDSD algorithm. The SDSD algorithm uses both the IP table and the flow bloom filter (BF) data structures to maintain the IP and flow information. The IP table is used to judge whether an IP address has been recorded. If the IP exists, then all its subsequent flows will be recorded into the flow BF; otherwise, the IP flow is sampled. This paper also analyzes the accuracy and memory requirements of the SDSD algorithm , and tests them using the CERNET trace. The theoretical analysis and experimental tests demonstrate that the most relative errors of the super points estimated by the SDSD algorithm are less than 5%, whereas the results of other algorithms are about 10%. Because of the BF structure, the SDSD algorithm is also better than previous algorithms in terms of memory consumption.
基金supported by the National Natural Science Foundation of China (70961005)211 Project for Postgraduate Student Program of Inner Mongolia University+1 种基金National Natural Science Foundation of Inner Mongolia (2010Zd342011MS1002)
文摘The conventional data envelopment analysis (DEA) measures the relative efficiencies of a set of decision making units with exact values of inputs and outputs. In real-world prob- lems, however, inputs and outputs typically have some levels of fuzziness. To analyze a decision making unit (DMU) with fuzzy input/output data, previous studies provided the fuzzy DEA model and proposed an associated evaluating approach. Nonetheless, numerous deficiencies must still be improved, including the α- cut approaches, types of fuzzy numbers, and ranking techniques. Moreover, a fuzzy sample DMU still cannot be evaluated for the Fuzzy DEA model. Therefore, this paper proposes a fuzzy DEA model based on sample decision making unit (FSDEA). Five eval- uation approaches and the related algorithm and ranking methods are provided to test the fuzzy sample DMU of the FSDEA model. A numerical experiment is used to demonstrate and compare the results with those obtained using alternative approaches.
基金supported in part by the Fundamental Research Funds for the Central Universities under Grant No.30919011282.
文摘Edge-computing-enabled smart greenhouses are a representative application of the Internet of Things(IoT)technology,which can monitor the environmental information in real-time and employ the information to contribute to intelligent decision-making.In the process,anomaly detection for wireless sensor data plays an important role.However,the traditional anomaly detection algorithms originally designed for anomaly detection in static data do not properly consider the inherent characteristics of the data stream produced by wireless sensors such as infiniteness,correlations,and concept drift,which may pose a considerable challenge to anomaly detection based on data stream and lead to low detection accuracy and efficiency.First,the data stream is usually generated quickly,which means that the data stream is infinite and enormous.Hence,any traditional off-line anomaly detection algorithm that attempts to store the whole dataset or to scan the dataset multiple times for anomaly detection will run out of memory space.Second,there exist correlations among different data streams,and traditional algorithms hardly consider these correlations.Third,the underlying data generation process or distribution may change over time.Thus,traditional anomaly detection algorithms with no model update will lose their effects.Considering these issues,a novel method(called DLSHiForest)based on Locality-Sensitive Hashing and the time window technique is proposed to solve these problems while achieving accurate and efficient detection.Comprehensive experiments are executed using a real-world agricultural greenhouse dataset to demonstrate the feasibility of our approach.Experimental results show that our proposal is practical for addressing the challenges of traditional anomaly detection while ensuring accuracy and efficiency.
文摘In this paper, we consider the problem of the evaluation of system reliability using statistical data obtained from reliability tests of its elements, in which the lifetimes of elements are described using an exponential distribution. We assume that this lifetime data may be reported imprecisely and that this lack of precision may be described using fuzzy sets. As the direct application of the fuzzy sets methodology leads in this case to very complicated and time consuming calculations, we propose simple approximations of fuzzy numbers using shadowed sets introduced by Pedrycz (1998). The proposed methodology may be simply extended to the case of general lifetime probability distributions.
基金supported by Fundamental Research Funds for the Central Universities (No. 2018XD004)
文摘The distance-based outlier detection method detects the implied outliers by calculating the distance of the points in the dataset, but the computational complexity is particularly high when processing multidimensional datasets. In addition, the traditional outlier detection method does not consider the frequency of subsets occurrence, thus, the detected outliers do not fit the definition of outliers (i.e., rarely appearing). The pattern mining-based outlier detection approaches have solved this problem, but the importance of each pattern is not taken into account in outlier detection process, so the detected outliers cannot truly reflect some actual situation. Aimed at these problems, a two-phase minimal weighted rare pattern mining-based outlier detection approach, called MWRPM-Outlier, is proposed to effectively detect outliers on the weight data stream. In particular, a method called MWRPM is proposed in the pattern mining phase to fast mine the minimal weighted rare patterns, and then two deviation factors are defined in outlier detection phase to measure the abnormal degree of each transaction on the weight data stream. Experimental results show that the proposed MWRPM-Outlier approach has excellent performance in outlier detection and MWRPM approach outperforms in weighted rare pattern mining.
文摘Clustering is one of the unsupervised learning problems.It is a procedure which partitions data objects into groups.Many algorithms could not overcome the problems of morphology,overlapping and the large number of clusters at the same time.Many scientific communities have used the clustering algorithm from the perspective of density,which is one of the best methods in clustering.This study proposes a density-based spatial clustering of applications with noise(DBSCAN)algorithm based on the selected high-density areas by automatic fuzzy-DBSCAN(AFD)which works with the initialization of two parameters.AFD,by using fuzzy and DBSCAN features,is modeled by the selection of high-density areas and generates two parameters for merging and separating automatically.The two generated parameters provide a state of sub-cluster rules in the Cartesian coordinate system for the dataset.The model overcomes the problems of clustering such as morphology,overlapping,and the number of clusters in a dataset simultaneously.In the experiments,all algorithms are performed on eight data sets with 30 times of running.Three of them are related to overlapping real datasets and the rest are morphologic and synthetic datasets.It is demonstrated that the AFD algorithm outperforms other recently developed clustering algorithms.
基金This work is supported,in part,by the National Natural Science Foundation of China under grant numbers 61300026in part,by the Natural Science Foundation of Fujian Province under grant numbers 2017J01754, 2018J01797.
文摘Continuous response of range query on steaming data provides useful information for many practical applications as well as the risk of privacy disclosure.The existing research on differential privacy streaming data publication mostly pay close attention to boosting query accuracy,but pay less attention to query efficiency,and ignore the effect of timeliness on data weight.In this paper,we propose an effective algorithm of differential privacy streaming data publication under exponential decay mode.Firstly,by introducing the Fenwick tree to divide and reorganize data items in the stream,we achieve a constant time complexity for inserting a new item and getting the prefix sum.Meanwhile,we achieve time complicity linear to the number of data item for building a tree.After that,we use the advantage of matrix mechanism to deal with relevant queries and reduce the global sensitivity.In addition,we choose proper diagonal matrix further improve the range query accuracy.Finally,considering about exponential decay,every data item is weighted by the decay factor.By putting the Fenwick tree and matrix optimization together,we present complete algorithm for differentiate private real-time streaming data publication.The experiment is designed to compare the algorithm in this paper with similar algorithms for streaming data release in exponential decay.Experimental results show that the algorithm in this paper effectively improve the query efficiency while ensuring the quality of the query.
基金Supported by the National Natural Science Foun-dation of China (60403027)
文摘This paper presents two one-pass algorithms for dynamically computing frequency counts in sliding window over a data stream-computing frequency counts exceeding user-specified threshold ε. The first algorithm constructs subwindows and deletes expired sub-windows periodically in sliding window, and each sub-window maintains a summary data structure. The first algorithm outputs at most 1/ε + 1 elements for frequency queries over the most recent N elements. The second algorithm adapts multiple levels method to deal with data stream. Once the sketch of the most recent N elements has been constructed, the second algorithm can provides the answers to the frequency queries over the most recent n ( n≤N) elements. The second algorithm outputs at most 1/ε + 2 elements. The analytical and experimental results show that our algorithms are accurate and effective.
基金the National Natural Science Foundation of China (60503024 50634010).
文摘One of the goals of data collection is preparing for decision-making, so high quality requirement must be satisfied. Rational evaluation of data quality is an effective way to identify data problem in time, and the quality of data after this evaluation is satisfactory with the requirement of decision maker. A fuzzy neural network based research method of data quality evaluation is proposed. First, the criteria for the evaluation of data quality are selected to construct the fuzzy sets of evaluating grades, and then by using the learning ability of NN, the objective evaluation of membership is carried out, which can be used for the effective evaluation of data quality. This research has been used in the platform of 'data report of national compulsory education outlay guarantee' from the Chinese Ministry of Education. This method can be used for the effective evaluation of data quality worldwide, and the data quality situation can be found out more completely, objectively, and in better time by using the method.
基金Supported by the National High Technology Research and Development Programme of China(No.2011AA120300,2011AA120302)the National Key Technology Support Program of China(No.2013BAH66F02)
文摘Data aggregation from various web sources is very significant for web data analysis domain. In ad- dition, the recognition of coherence micro cluster is one of the most interesting issues in the field of data aggregation. Until now, many algorithms have been proposed to work on this issue. However, the deficiency of these solutions is that they cannot recognize the micro-cluster data stream accurately. A semantic-based coherent micro-cluster recognition algorithm for hybrid web data stream is nronosed.Firstly, an objective function is proposed to recognize the coherence micro-cluster and then the coher- ence micro-cluster recognition algorithm for hybrid web data stream based on semantic is raised. Fi-