Concept drift is a main security issue that has to be resolved since it presents a significant barrier to the deployment of machine learning(ML)models.Due to attackers’(and/or benign equivalents’)dynamic behavior ch...Concept drift is a main security issue that has to be resolved since it presents a significant barrier to the deployment of machine learning(ML)models.Due to attackers’(and/or benign equivalents’)dynamic behavior changes,testing data distribution frequently diverges from original training data over time,resulting in substantial model failures.Due to their dispersed and dynamic nature,distributed denial-of-service attacks pose a danger to cybersecurity,resulting in attacks with serious consequences for users and businesses.This paper proposes a novel design for concept drift analysis and detection of malware attacks like Distributed Denial of Service(DDOS)in the network.The goal of this architecture combination is to accurately represent data and create an effective cyber security prediction agent.The intrusion detection system and concept drift of the network has been analyzed using secure adaptive windowing with website data authentication protocol(SAW_WDA).The network has been analyzed by authentication protocol to avoid malware attacks.The data of network users will be collected and classified using multilayer perceptron gradient decision tree(MLPGDT)classifiers.Based on the classification output,the decision for the detection of attackers and authorized users will be identified.The experimental results show output based on intrusion detection and concept drift analysis systems in terms of throughput,end-end delay,network security,network concept drift,and results based on classification with regard to accuracy,memory,and precision and F-1 score.展开更多
Every application in a smart city environment like the smart grid,health monitoring, security, and surveillance generates non-stationary datastreams. Due to such nature, the statistical properties of data changes over...Every application in a smart city environment like the smart grid,health monitoring, security, and surveillance generates non-stationary datastreams. Due to such nature, the statistical properties of data changes overtime, leading to class imbalance and concept drift issues. Both these issuescause model performance degradation. Most of the current work has beenfocused on developing an ensemble strategy by training a new classifier on thelatest data to resolve the issue. These techniques suffer while training the newclassifier if the data is imbalanced. Also, the class imbalance ratio may changegreatly from one input stream to another, making the problem more complex.The existing solutions proposed for addressing the combined issue of classimbalance and concept drift are lacking in understating of correlation of oneproblem with the other. This work studies the association between conceptdrift and class imbalance ratio and then demonstrates how changes in classimbalance ratio along with concept drift affect the classifier’s performance.We analyzed the effect of both the issues on minority and majority classesindividually. To do this, we conducted experiments on benchmark datasetsusing state-of-the-art classifiers especially designed for data stream classification.Precision, recall, F1 score, and geometric mean were used to measure theperformance. Our findings show that when both class imbalance and conceptdrift problems occur together the performance can decrease up to 15%. Ourresults also show that the increase in the imbalance ratio can cause a 10% to15% decrease in the precision scores of both minority and majority classes.The study findings may help in designing intelligent and adaptive solutionsthat can cope with the challenges of non-stationary data streams like conceptdrift and class imbalance.展开更多
Big data streams started becoming ubiquitous in recent years,thanks to rapid generation of massive volumes of data by different applications.It is challenging to apply existing data mining tools and techniques directl...Big data streams started becoming ubiquitous in recent years,thanks to rapid generation of massive volumes of data by different applications.It is challenging to apply existing data mining tools and techniques directly in these big data streams.At the same time,streaming data from several applications results in two major problems such as class imbalance and concept drift.The current research paper presents a new Multi-Objective Metaheuristic Optimization-based Big Data Analytics with Concept Drift Detection(MOMBD-CDD)method on High-Dimensional Streaming Data.The presented MOMBD-CDD model has different operational stages such as pre-processing,CDD,and classification.MOMBD-CDD model overcomes class imbalance problem by Synthetic Minority Over-sampling Technique(SMOTE).In order to determine the oversampling rates and neighboring point values of SMOTE,Glowworm Swarm Optimization(GSO)algorithm is employed.Besides,Statistical Test of Equal Proportions(STEPD),a CDD technique is also utilized.Finally,Bidirectional Long Short-Term Memory(Bi-LSTM)model is applied for classification.In order to improve classification performance and to compute the optimum parameters for Bi-LSTM model,GSO-based hyperparameter tuning process is carried out.The performance of the presented model was evaluated using high dimensional benchmark streaming datasets namely intrusion detection(NSL KDDCup)dataset and ECUE spam dataset.An extensive experimental validation process confirmed the effective outcome of MOMBD-CDD model.The proposed model attained high accuracy of 97.45%and 94.23%on the applied KDDCup99 Dataset and ECUE Spam datasets respectively.展开更多
One recent area of interest in computer science is data stream management and processing. By ‘data stream', we refer to continuous and rapidly generated packages of data. Specific features of data streams are imm...One recent area of interest in computer science is data stream management and processing. By ‘data stream', we refer to continuous and rapidly generated packages of data. Specific features of data streams are immense volume, high production rate, limited data processing time, and data concept drift; these features differentiate the data stream from standard types of data. An issue for the data stream is classification of input data. A novel ensemble classifier is proposed in this paper. The classifier uses base classifiers of two weighting functions under different data input conditions. In addition, a new method is used to determine drift, which emphasizes the precision of the algorithm. Another characteristic of the proposed method is removal of different numbers of the base classifiers based on their quality. Implementation of a weighting mechanism to the base classifiers at the decision-making stage is another advantage of the algorithm. This facilitates adaptability when drifts take place, which leads to classifiers with higher efficiency. Furthermore, the proposed method is tested on a set of standard data and the results confirm higher accuracy compared to available ensemble classifiers and single classifiers. In addition, in some cases the proposed classifier is faster and needs less storage space.展开更多
Textual data streams have been extensively used in practical applications where consumers of online products have expressed their views regarding online products.Due to changes in data distribution,commonly referred t...Textual data streams have been extensively used in practical applications where consumers of online products have expressed their views regarding online products.Due to changes in data distribution,commonly referred to as concept drift,mining this data stream is a challenging problem for researchers.The majority of the existing drift detection techniques are based on classification errors,which have higher probabilities of false-positive or missed detections.To improve classification accuracy,there is a need to develop more intuitive detection techniques that can identify a great number of drifts in the data streams.This paper presents an adaptive unsupervised learning technique,an ensemble classifier based on drift detection for opinion mining and sentiment classification.To improve classification performance,this approach uses four different dissimilarity measures to determine the degree of concept drifts in the data stream.Whenever a drift is detected,the proposed method builds and adds a new classifier to the ensemble.To add a new classifier,the total number of classifiers in the ensemble is first checked if the limit is exceeded before the classifier with the least weight is removed from the ensemble.To this end,a weighting mechanism is used to calculate the weight of each classifier,which decides the contribution of each classifier in the final classification results.Several experiments were conducted on real-world datasets and the resultswere evaluated on the false positive rate,miss detection rate,and accuracy measures.The proposed method is also compared with the state-of-the-art methods,which include DDM,EDDM,and PageHinkley with support vector machine(SVM)and Naive Bayes classifiers that are frequently used in concept drift detection studies.In all cases,the results show the efficiency of our proposed method.展开更多
概念漂移是流数据挖掘领域中的一个重要且具有挑战性的难题.然而,目前的方法大多仅能够处理线性或简单的非线性映射,深度神经网络虽然有较强的非线性拟合能力,但在流数据挖掘任务中,每次只能在新得到的1个或一批样本上进行训练,学习模...概念漂移是流数据挖掘领域中的一个重要且具有挑战性的难题.然而,目前的方法大多仅能够处理线性或简单的非线性映射,深度神经网络虽然有较强的非线性拟合能力,但在流数据挖掘任务中,每次只能在新得到的1个或一批样本上进行训练,学习模型难以实时调整以适应动态变化的数据流.为解决上述问题,将梯度提升算法的纠错思想引入含概念漂移的流数据挖掘任务之中,提出了一种基于自适应深度集成网络的概念漂移收敛方法(concept drift convergence method based on adaptive deep ensemble networks,CD_ADEN).该模型集成多个浅层神经网络作为基学习器,后序基学习器在前序基学习器输出的基础上不断纠错,具有较高的实时泛化性能.此外,由于浅层神经网络有较快的收敛速度,因此所提出的模型能够较快地从概念漂移造成的精度下降中恢复.多个数据集上的实验结果表明,所提出的CD_ADEN方法平均实时精度有明显提高,相较于对比方法,平均实时精度有1%~5%的提升,且平均序值在7种典型的对比算法中排名第一.说明所提出的方法能够对前序输出进行纠错,且学习模型能够快速地从概念漂移造成的精度下降中恢复,提升了在线学习模型的实时泛化性能.展开更多
大数据时代,越来越多的数据以数据流的形式产生,由于其具有快速、无限、不稳定及动态变化等特性,使得概念漂移成为流数据挖掘中一个重要但困难的问题.目前多数概念漂移处理方法存在信息提取能力有限且未充分考虑流数据的时序特性等问题...大数据时代,越来越多的数据以数据流的形式产生,由于其具有快速、无限、不稳定及动态变化等特性,使得概念漂移成为流数据挖掘中一个重要但困难的问题.目前多数概念漂移处理方法存在信息提取能力有限且未充分考虑流数据的时序特性等问题.针对这些问题,提出一种基于混合特征提取的流数据概念漂移处理方法(concept drift processing method of streaming data based on mixed feature extraction,MFECD).该方法首先采用不同尺度的卷积核对数据进行建模以构建拼接特征,采用门控机制将浅层输入和拼接特征融合,作为不同网络层次输入进行自适应集成,以获得能够兼顾细节信息和语义信息的数据特性.在此基础上,采用注意力机制和相似度计算评估流数据不同时刻的重要性,以增强数据流关键位点的时序特性.实验结果表明,该方法能有效提取流数据中包含的复杂数据特征和时序特征,提高了数据流中概念漂移的处理能力.展开更多
We applied the decision tree algorithm to learn association rules between webpage’s category(pornographic or normal) and the critical features.Based on these rules, we proposed an efficient method of filtering pornog...We applied the decision tree algorithm to learn association rules between webpage’s category(pornographic or normal) and the critical features.Based on these rules, we proposed an efficient method of filtering pornographic webpages with the following major advantages: 1) a weighted window-based technique was proposed to estimate for the condition of concept drift for the keywords found recently in pornographic webpages; 2) checking only contexts of webpages without scanning pictures; 3) an incremental learning mechanism was designed to incrementally update the pornographic keyword database.展开更多
为了快速适应非平稳环境中工业数据流的分布变化,需要在非结构化和噪声干扰的数据中准确、实时的完成概念漂移的检测.本文提出了一种基于多元区域集划分的工业数据流概念漂移检测算法(Concept Drift detection-Multivariate region set ...为了快速适应非平稳环境中工业数据流的分布变化,需要在非结构化和噪声干扰的数据中准确、实时的完成概念漂移的检测.本文提出了一种基于多元区域集划分的工业数据流概念漂移检测算法(Concept Drift detection-Multivariate region set Partition,CDMP).首先基于实例模糊密度进行多元区域集划分,根据划分的若干模糊分区集合,识别概念漂移发生的区域.概念漂移的持续发生会显著降低基于多元区域集构建的模型的分类性能,CDMP通过构建多元历史模型池来保留具有多样性的历史模型,以降低模型调整或再训练造成的性能损耗,同时保证概念漂移检测中准确性.CDMP在不同数据集上进行了性能测试.实验结果表明,CDMP实现了对历史模型多样性的保留和重用,能够在不同噪声水平的工业物联网环境中实现对重现型、突发型等多类型概念漂移的准确检测.展开更多
针对流数据中概念漂移发生后,在线学习模型不能对分布变化后的数据做出及时响应且难以提取数据分布的最新信息,导致学习模型收敛较慢的问题,提出一种基于在线集成的概念漂移自适应分类方法(adaptive classification method for concept ...针对流数据中概念漂移发生后,在线学习模型不能对分布变化后的数据做出及时响应且难以提取数据分布的最新信息,导致学习模型收敛较慢的问题,提出一种基于在线集成的概念漂移自适应分类方法(adaptive classification method for concept drift based on online ensemble,AC_OE).一方面,该方法利用在线集成策略构建在线集成学习器,对数据块中的训练样本进行局部预测以动态调整学习器权重,有助于深入提取漂移位点附近流数据的演化信息,对数据分布变化进行精准响应,提升在线学习模型对概念漂移发生后新数据分布的适应能力,提高学习模型的实时泛化性能;另一方面,利用增量学习策略构建增量学习器,并随新样本的进入进行增量式的训练更新,提取流数据的全局分布信息,使模型在平稳的流数据状态下保持较好的鲁棒性.实验结果表明,该方法能够对概念漂移做出及时响应并加速在线学习模型的收敛速度,同时有效提高学习器的整体泛化性能.展开更多
基金The Taif University Deanship of Scientific Research supported this endeavor(Project Number:1-443-4)for which the authors are grateful to Taif University for their kind support.
文摘Concept drift is a main security issue that has to be resolved since it presents a significant barrier to the deployment of machine learning(ML)models.Due to attackers’(and/or benign equivalents’)dynamic behavior changes,testing data distribution frequently diverges from original training data over time,resulting in substantial model failures.Due to their dispersed and dynamic nature,distributed denial-of-service attacks pose a danger to cybersecurity,resulting in attacks with serious consequences for users and businesses.This paper proposes a novel design for concept drift analysis and detection of malware attacks like Distributed Denial of Service(DDOS)in the network.The goal of this architecture combination is to accurately represent data and create an effective cyber security prediction agent.The intrusion detection system and concept drift of the network has been analyzed using secure adaptive windowing with website data authentication protocol(SAW_WDA).The network has been analyzed by authentication protocol to avoid malware attacks.The data of network users will be collected and classified using multilayer perceptron gradient decision tree(MLPGDT)classifiers.Based on the classification output,the decision for the detection of attackers and authorized users will be identified.The experimental results show output based on intrusion detection and concept drift analysis systems in terms of throughput,end-end delay,network security,network concept drift,and results based on classification with regard to accuracy,memory,and precision and F-1 score.
基金The authors would like to extend their gratitude to Universiti Teknologi PETRONAS (Malaysia)for funding this research through grant number (015LA0-037).
文摘Every application in a smart city environment like the smart grid,health monitoring, security, and surveillance generates non-stationary datastreams. Due to such nature, the statistical properties of data changes overtime, leading to class imbalance and concept drift issues. Both these issuescause model performance degradation. Most of the current work has beenfocused on developing an ensemble strategy by training a new classifier on thelatest data to resolve the issue. These techniques suffer while training the newclassifier if the data is imbalanced. Also, the class imbalance ratio may changegreatly from one input stream to another, making the problem more complex.The existing solutions proposed for addressing the combined issue of classimbalance and concept drift are lacking in understating of correlation of oneproblem with the other. This work studies the association between conceptdrift and class imbalance ratio and then demonstrates how changes in classimbalance ratio along with concept drift affect the classifier’s performance.We analyzed the effect of both the issues on minority and majority classesindividually. To do this, we conducted experiments on benchmark datasetsusing state-of-the-art classifiers especially designed for data stream classification.Precision, recall, F1 score, and geometric mean were used to measure theperformance. Our findings show that when both class imbalance and conceptdrift problems occur together the performance can decrease up to 15%. Ourresults also show that the increase in the imbalance ratio can cause a 10% to15% decrease in the precision scores of both minority and majority classes.The study findings may help in designing intelligent and adaptive solutionsthat can cope with the challenges of non-stationary data streams like conceptdrift and class imbalance.
文摘Big data streams started becoming ubiquitous in recent years,thanks to rapid generation of massive volumes of data by different applications.It is challenging to apply existing data mining tools and techniques directly in these big data streams.At the same time,streaming data from several applications results in two major problems such as class imbalance and concept drift.The current research paper presents a new Multi-Objective Metaheuristic Optimization-based Big Data Analytics with Concept Drift Detection(MOMBD-CDD)method on High-Dimensional Streaming Data.The presented MOMBD-CDD model has different operational stages such as pre-processing,CDD,and classification.MOMBD-CDD model overcomes class imbalance problem by Synthetic Minority Over-sampling Technique(SMOTE).In order to determine the oversampling rates and neighboring point values of SMOTE,Glowworm Swarm Optimization(GSO)algorithm is employed.Besides,Statistical Test of Equal Proportions(STEPD),a CDD technique is also utilized.Finally,Bidirectional Long Short-Term Memory(Bi-LSTM)model is applied for classification.In order to improve classification performance and to compute the optimum parameters for Bi-LSTM model,GSO-based hyperparameter tuning process is carried out.The performance of the presented model was evaluated using high dimensional benchmark streaming datasets namely intrusion detection(NSL KDDCup)dataset and ECUE spam dataset.An extensive experimental validation process confirmed the effective outcome of MOMBD-CDD model.The proposed model attained high accuracy of 97.45%and 94.23%on the applied KDDCup99 Dataset and ECUE Spam datasets respectively.
文摘One recent area of interest in computer science is data stream management and processing. By ‘data stream', we refer to continuous and rapidly generated packages of data. Specific features of data streams are immense volume, high production rate, limited data processing time, and data concept drift; these features differentiate the data stream from standard types of data. An issue for the data stream is classification of input data. A novel ensemble classifier is proposed in this paper. The classifier uses base classifiers of two weighting functions under different data input conditions. In addition, a new method is used to determine drift, which emphasizes the precision of the algorithm. Another characteristic of the proposed method is removal of different numbers of the base classifiers based on their quality. Implementation of a weighting mechanism to the base classifiers at the decision-making stage is another advantage of the algorithm. This facilitates adaptability when drifts take place, which leads to classifiers with higher efficiency. Furthermore, the proposed method is tested on a set of standard data and the results confirm higher accuracy compared to available ensemble classifiers and single classifiers. In addition, in some cases the proposed classifier is faster and needs less storage space.
基金The authors extend their appreciation to the Deanship of Scientific Research at King Khalid University for funding this work through Large Groups(Project under Grant Number(RGP.2/49/43)).
文摘Textual data streams have been extensively used in practical applications where consumers of online products have expressed their views regarding online products.Due to changes in data distribution,commonly referred to as concept drift,mining this data stream is a challenging problem for researchers.The majority of the existing drift detection techniques are based on classification errors,which have higher probabilities of false-positive or missed detections.To improve classification accuracy,there is a need to develop more intuitive detection techniques that can identify a great number of drifts in the data streams.This paper presents an adaptive unsupervised learning technique,an ensemble classifier based on drift detection for opinion mining and sentiment classification.To improve classification performance,this approach uses four different dissimilarity measures to determine the degree of concept drifts in the data stream.Whenever a drift is detected,the proposed method builds and adds a new classifier to the ensemble.To add a new classifier,the total number of classifiers in the ensemble is first checked if the limit is exceeded before the classifier with the least weight is removed from the ensemble.To this end,a weighting mechanism is used to calculate the weight of each classifier,which decides the contribution of each classifier in the final classification results.Several experiments were conducted on real-world datasets and the resultswere evaluated on the false positive rate,miss detection rate,and accuracy measures.The proposed method is also compared with the state-of-the-art methods,which include DDM,EDDM,and PageHinkley with support vector machine(SVM)and Naive Bayes classifiers that are frequently used in concept drift detection studies.In all cases,the results show the efficiency of our proposed method.
文摘概念漂移是流数据挖掘领域中的一个重要且具有挑战性的难题.然而,目前的方法大多仅能够处理线性或简单的非线性映射,深度神经网络虽然有较强的非线性拟合能力,但在流数据挖掘任务中,每次只能在新得到的1个或一批样本上进行训练,学习模型难以实时调整以适应动态变化的数据流.为解决上述问题,将梯度提升算法的纠错思想引入含概念漂移的流数据挖掘任务之中,提出了一种基于自适应深度集成网络的概念漂移收敛方法(concept drift convergence method based on adaptive deep ensemble networks,CD_ADEN).该模型集成多个浅层神经网络作为基学习器,后序基学习器在前序基学习器输出的基础上不断纠错,具有较高的实时泛化性能.此外,由于浅层神经网络有较快的收敛速度,因此所提出的模型能够较快地从概念漂移造成的精度下降中恢复.多个数据集上的实验结果表明,所提出的CD_ADEN方法平均实时精度有明显提高,相较于对比方法,平均实时精度有1%~5%的提升,且平均序值在7种典型的对比算法中排名第一.说明所提出的方法能够对前序输出进行纠错,且学习模型能够快速地从概念漂移造成的精度下降中恢复,提升了在线学习模型的实时泛化性能.
文摘大数据时代,越来越多的数据以数据流的形式产生,由于其具有快速、无限、不稳定及动态变化等特性,使得概念漂移成为流数据挖掘中一个重要但困难的问题.目前多数概念漂移处理方法存在信息提取能力有限且未充分考虑流数据的时序特性等问题.针对这些问题,提出一种基于混合特征提取的流数据概念漂移处理方法(concept drift processing method of streaming data based on mixed feature extraction,MFECD).该方法首先采用不同尺度的卷积核对数据进行建模以构建拼接特征,采用门控机制将浅层输入和拼接特征融合,作为不同网络层次输入进行自适应集成,以获得能够兼顾细节信息和语义信息的数据特性.在此基础上,采用注意力机制和相似度计算评估流数据不同时刻的重要性,以增强数据流关键位点的时序特性.实验结果表明,该方法能有效提取流数据中包含的复杂数据特征和时序特征,提高了数据流中概念漂移的处理能力.
基金supported by MOST under Grant No.MOST 103-2410-H-004-112
文摘We applied the decision tree algorithm to learn association rules between webpage’s category(pornographic or normal) and the critical features.Based on these rules, we proposed an efficient method of filtering pornographic webpages with the following major advantages: 1) a weighted window-based technique was proposed to estimate for the condition of concept drift for the keywords found recently in pornographic webpages; 2) checking only contexts of webpages without scanning pictures; 3) an incremental learning mechanism was designed to incrementally update the pornographic keyword database.
文摘为了快速适应非平稳环境中工业数据流的分布变化,需要在非结构化和噪声干扰的数据中准确、实时的完成概念漂移的检测.本文提出了一种基于多元区域集划分的工业数据流概念漂移检测算法(Concept Drift detection-Multivariate region set Partition,CDMP).首先基于实例模糊密度进行多元区域集划分,根据划分的若干模糊分区集合,识别概念漂移发生的区域.概念漂移的持续发生会显著降低基于多元区域集构建的模型的分类性能,CDMP通过构建多元历史模型池来保留具有多样性的历史模型,以降低模型调整或再训练造成的性能损耗,同时保证概念漂移检测中准确性.CDMP在不同数据集上进行了性能测试.实验结果表明,CDMP实现了对历史模型多样性的保留和重用,能够在不同噪声水平的工业物联网环境中实现对重现型、突发型等多类型概念漂移的准确检测.
文摘针对流数据中概念漂移发生后,在线学习模型不能对分布变化后的数据做出及时响应且难以提取数据分布的最新信息,导致学习模型收敛较慢的问题,提出一种基于在线集成的概念漂移自适应分类方法(adaptive classification method for concept drift based on online ensemble,AC_OE).一方面,该方法利用在线集成策略构建在线集成学习器,对数据块中的训练样本进行局部预测以动态调整学习器权重,有助于深入提取漂移位点附近流数据的演化信息,对数据分布变化进行精准响应,提升在线学习模型对概念漂移发生后新数据分布的适应能力,提高学习模型的实时泛化性能;另一方面,利用增量学习策略构建增量学习器,并随新样本的进入进行增量式的训练更新,提取流数据的全局分布信息,使模型在平稳的流数据状态下保持较好的鲁棒性.实验结果表明,该方法能够对概念漂移做出及时响应并加速在线学习模型的收敛速度,同时有效提高学习器的整体泛化性能.