流量分类是优化网络服务质量的基础与关键.机器学习算法利用数据流统计特征分类流量,对于识别加密私有协议流量具有重要意义.然而,特征偏置和类别不平衡是基于机器学习的流量分类研究所面临的两大挑战.特征偏置是指一些数据流统计特征...流量分类是优化网络服务质量的基础与关键.机器学习算法利用数据流统计特征分类流量,对于识别加密私有协议流量具有重要意义.然而,特征偏置和类别不平衡是基于机器学习的流量分类研究所面临的两大挑战.特征偏置是指一些数据流统计特征在提高部分应用识别准确率的同时也降低了另外一部分应用识别的准确率.类别不平衡是指机器学习流量分类器对样本数较少的应用识别的准确率较低.为解决上述问题,提出了基于集成聚类的流量分类架构(traffic classification framework based on ensemble clustering,简称TCFEC).TCFEC由多个基于不同特征子空间聚类的基分类器和一个最优决策部件构成,能够提高流量分类的准确率.具体而言,与传统的机器学习流量分类器相比,TCFEC的平均流准确率最高提升5%,字节准确率最高提升6%.展开更多
A new algorithm for clustering multiple data streams is proposed.The algorithm can effectively cluster data streams which show similar behavior with some unknown time delays.The algorithm uses the autoregressive (AR...A new algorithm for clustering multiple data streams is proposed.The algorithm can effectively cluster data streams which show similar behavior with some unknown time delays.The algorithm uses the autoregressive (AR) modeling technique to measure correlations between data streams.It exploits estimated frequencies spectra to extract the essential features of streams.Each stream is represented as the sum of spectral components and the correlation is measured component-wise.Each spectral component is described by four parameters,namely,amplitude,phase,damping rate and frequency.The ε-lag-correlation between two spectral components is calculated.The algorithm uses such information as similarity measures in clustering data streams.Based on a sliding window model,the algorithm can continuously report the most recent clustering results and adjust the number of clusters.Experiments on real and synthetic streams show that the proposed clustering method has a higher speed and clustering quality than other similar methods.展开更多
P2P traffic has always been a dominant portion of Internet traffic since its emergence in the late 1990s. The method used to accurately classify P2P traffic remains a key problem for Internet Service Producers (ISPs...P2P traffic has always been a dominant portion of Internet traffic since its emergence in the late 1990s. The method used to accurately classify P2P traffic remains a key problem for Internet Service Producers (ISPs) and network managers. This paper proposes a novel approach to the accurate classification of P2P traffic at a fine-grained level, which depends solely on the number of special flows during small time intervals. These special flows, named Clustering Flows (CFs), are de- fined as the most frequent and steady flows generated by P2P applications. Hence we are able to classify P2P applications by detecting tlle appearance of corresponding CFs. Com- pared to existing approaches, our classifier can realise high classification accuracy by ex- ploiting only several generic properties of flows, instead of extracting sophisticated fea- tures from host behaviours or transport layer data. We validate our framework on a large set of P2P traffic traces using a Support Vector Machine (SVM). Experimental results show that our approach correctly classifies P2P ap- plications with an average true positive rate of above 98% and a negligible false positive rate of about 0.01%.展开更多
文摘流量分类是优化网络服务质量的基础与关键.机器学习算法利用数据流统计特征分类流量,对于识别加密私有协议流量具有重要意义.然而,特征偏置和类别不平衡是基于机器学习的流量分类研究所面临的两大挑战.特征偏置是指一些数据流统计特征在提高部分应用识别准确率的同时也降低了另外一部分应用识别的准确率.类别不平衡是指机器学习流量分类器对样本数较少的应用识别的准确率较低.为解决上述问题,提出了基于集成聚类的流量分类架构(traffic classification framework based on ensemble clustering,简称TCFEC).TCFEC由多个基于不同特征子空间聚类的基分类器和一个最优决策部件构成,能够提高流量分类的准确率.具体而言,与传统的机器学习流量分类器相比,TCFEC的平均流准确率最高提升5%,字节准确率最高提升6%.
基金The National Natural Science Foundation of China(No.60673060)the Natural Science Foundation of Jiangsu Province(No.BK2005047)
文摘A new algorithm for clustering multiple data streams is proposed.The algorithm can effectively cluster data streams which show similar behavior with some unknown time delays.The algorithm uses the autoregressive (AR) modeling technique to measure correlations between data streams.It exploits estimated frequencies spectra to extract the essential features of streams.Each stream is represented as the sum of spectral components and the correlation is measured component-wise.Each spectral component is described by four parameters,namely,amplitude,phase,damping rate and frequency.The ε-lag-correlation between two spectral components is calculated.The algorithm uses such information as similarity measures in clustering data streams.Based on a sliding window model,the algorithm can continuously report the most recent clustering results and adjust the number of clusters.Experiments on real and synthetic streams show that the proposed clustering method has a higher speed and clustering quality than other similar methods.
基金supported by the National Natural Science Foundation of China under Grants No.61170286,No.61202486
文摘P2P traffic has always been a dominant portion of Internet traffic since its emergence in the late 1990s. The method used to accurately classify P2P traffic remains a key problem for Internet Service Producers (ISPs) and network managers. This paper proposes a novel approach to the accurate classification of P2P traffic at a fine-grained level, which depends solely on the number of special flows during small time intervals. These special flows, named Clustering Flows (CFs), are de- fined as the most frequent and steady flows generated by P2P applications. Hence we are able to classify P2P applications by detecting tlle appearance of corresponding CFs. Com- pared to existing approaches, our classifier can realise high classification accuracy by ex- ploiting only several generic properties of flows, instead of extracting sophisticated fea- tures from host behaviours or transport layer data. We validate our framework on a large set of P2P traffic traces using a Support Vector Machine (SVM). Experimental results show that our approach correctly classifies P2P ap- plications with an average true positive rate of above 98% and a negligible false positive rate of about 0.01%.