Attacks such as APT usually hide communication data in massive legitimate network traffic, and mining structurally complex and latent relationships among flow-based network traffic to detect attacks has become the foc...Attacks such as APT usually hide communication data in massive legitimate network traffic, and mining structurally complex and latent relationships among flow-based network traffic to detect attacks has become the focus of many initiatives. Effectively analyzing massive network security data with high dimensions for suspicious flow diagnosis is a huge challenge. In addition, the uneven distribution of network traffic does not fully reflect the differences of class sample features, resulting in the low accuracy of attack detection. To solve these problems, a novel approach called the fuzzy entropy weighted natural nearest neighbor(FEW-NNN) method is proposed to enhance the accuracy and efficiency of flowbased network traffic attack detection. First, the FEW-NNN method uses the Fisher score and deep graph feature learning algorithm to remove unimportant features and reduce the data dimension. Then, according to the proposed natural nearest neighbor searching algorithm(NNN_Searching), the density of data points, each class center and the smallest enclosing sphere radius are determined correspondingly. Finally, a fuzzy entropy weighted KNN classification method based on affinity is proposed, which mainly includes the following three steps: 1、 the feature weights of samples are calculated based on fuzzy entropy values, 2、 the fuzzy memberships of samples are determined based on affinity among samples, and 3、 K-neighbors are selected according to the class-conditional weighted Euclidean distance, the fuzzy membership value of the testing sample is calculated based on the membership of k-neighbors, and then all testing samples are classified according to the fuzzy membership value of the samples belonging to each class;that is, the attack type is determined. The method has been applied to the problem of attack detection and validated based on the famous KDD99 and CICIDS-2017 datasets. From the experimental results shown in this paper, it is observed that the FEW-NNN method improves the accuracy and efficiency of flow-based network traffic attack detection.展开更多
在定位请求服务中,如何保护用户的位置隐私和位置服务提供商(Localization service provider,LSP)的数据隐私是关系到WiFi指纹定位应用的一个具有挑战性的问题。基于密文域的K-近邻(K-nearest neighbors,KNN)检索,本文提出了一种适用于...在定位请求服务中,如何保护用户的位置隐私和位置服务提供商(Localization service provider,LSP)的数据隐私是关系到WiFi指纹定位应用的一个具有挑战性的问题。基于密文域的K-近邻(K-nearest neighbors,KNN)检索,本文提出了一种适用于三方的定位隐私保护算法,能有效提升对LSP指纹信息隐私的保护强度并降低计算开销。服务器和用户分别完成对指纹信息和定位请求的加密,而第三方则基于加密指纹库和加密定位请求,在隐私状态下完成对用户的位置估计。所提算法把各参考点的位置信息随机嵌入指纹,可避免恶意用户获取各参考点的具体位置;进一步利用布隆滤波器在隐藏接入点信息的情况下,第三方可完成参考点的在线匹配,实现对用户隐私状态下的粗定位,可与定位算法结合降低计算开销。在公共数据集和实验室数据集中,对两种算法的安全、开销和定位性能进行了全面的评估。与同类加密算法比较,在不降低定位精度的情况下,进一步增强了对数据隐私的保护。展开更多
It is well known that the nonparametric estimation of the regression function is highly sensitive to the presence of even a small proportion of outliers in the data.To solve the problem of typical observations when th...It is well known that the nonparametric estimation of the regression function is highly sensitive to the presence of even a small proportion of outliers in the data.To solve the problem of typical observations when the covariates of the nonparametric component are functional,the robust estimates for the regression parameter and regression operator are introduced.The main propose of the paper is to consider data-driven methods of selecting the number of neighbors in order to make the proposed processes fully automatic.We use thek Nearest Neighbors procedure(kNN)to construct the kernel estimator of the proposed robust model.Under some regularity conditions,we state consistency results for kNN functional estimators,which are uniform in the number of neighbors(UINN).Furthermore,a simulation study and an empirical application to a real data analysis of octane gasoline predictions are carried out to illustrate the higher predictive performances and the usefulness of the kNN approach.展开更多
Winding is one of themost important components in power transformers.Ensuring the health state of the winding is of great importance to the stable operation of the power system.To efficiently and accurately diagnose t...Winding is one of themost important components in power transformers.Ensuring the health state of the winding is of great importance to the stable operation of the power system.To efficiently and accurately diagnose the disc space variation(DSV)fault degree of transformer winding,this paper presents a diagnostic method of winding fault based on the K-Nearest Neighbor(KNN)algorithmand the frequency response analysis(FRA)method.First,a laboratory winding model is used,and DSV faults with four different degrees are achieved by changing disc space of the discs in the winding.Then,a series of FRA tests are conducted to obtain the FRA results and set up the FRA dataset.Second,ten different numerical indices are utilized to obtain features of FRA curves of faulted winding.Third,the 10-fold cross-validation method is employed to determine the optimal k-value of KNN.In addition,to improve the accuracy of the KNN model,a comparative analysis is made between the accuracy of the KNN algorithm and k-value under four distance functions.After getting the most appropriate distance metric and kvalue,the fault classificationmodel based on theKNN and FRA is constructed and it is used to classify the degrees of DSV faults.The identification accuracy rate of the proposed model is up to 98.30%.Finally,the performance of the model is presented by comparing with the support vector machine(SVM),SVM optimized by the particle swarmoptimization(PSO-SVM)method,and randomforest(RF).The results show that the diagnosis accuracy of the proposed model is the highest and the model can be used to accurately diagnose the DSV fault degrees of the winding.展开更多
In this paper,Edgeworth expansion for the nearest neighbor\|kernel estimate and random weighting approximation of conditional density are given and the consistency and convergence rate are proved.
Compositional data, such as relative information, is a crucial aspect of machine learning and other related fields. It is typically recorded as closed data or sums to a constant, like 100%. The statistical linear mode...Compositional data, such as relative information, is a crucial aspect of machine learning and other related fields. It is typically recorded as closed data or sums to a constant, like 100%. The statistical linear model is the most used technique for identifying hidden relationships between underlying random variables of interest. However, data quality is a significant challenge in machine learning, especially when missing data is present. The linear regression model is a commonly used statistical modeling technique used in various applications to find relationships between variables of interest. When estimating linear regression parameters which are useful for things like future prediction and partial effects analysis of independent variables, maximum likelihood estimation (MLE) is the method of choice. However, many datasets contain missing observations, which can lead to costly and time-consuming data recovery. To address this issue, the expectation-maximization (EM) algorithm has been suggested as a solution for situations including missing data. The EM algorithm repeatedly finds the best estimates of parameters in statistical models that depend on variables or data that have not been observed. This is called maximum likelihood or maximum a posteriori (MAP). Using the present estimate as input, the expectation (E) step constructs a log-likelihood function. Finding the parameters that maximize the anticipated log-likelihood, as determined in the E step, is the job of the maximization (M) phase. This study looked at how well the EM algorithm worked on a made-up compositional dataset with missing observations. It used both the robust least square version and ordinary least square regression techniques. The efficacy of the EM algorithm was compared with two alternative imputation techniques, k-Nearest Neighbor (k-NN) and mean imputation (), in terms of Aitchison distances and covariance.展开更多
基金the Natural Science Foundation of China (No. 61802404, 61602470)the Strategic Priority Research Program (C) of the Chinese Academy of Sciences (No. XDC02040100)+3 种基金the Fundamental Research Funds for the Central Universities of the China University of Labor Relations (No. 20ZYJS017, 20XYJS003)the Key Research Program of the Beijing Municipal Science & Technology Commission (No. D181100000618003)partially the Key Laboratory of Network Assessment Technology,the Chinese Academy of Sciencesthe Beijing Key Laboratory of Network Security and Protection Technology
文摘Attacks such as APT usually hide communication data in massive legitimate network traffic, and mining structurally complex and latent relationships among flow-based network traffic to detect attacks has become the focus of many initiatives. Effectively analyzing massive network security data with high dimensions for suspicious flow diagnosis is a huge challenge. In addition, the uneven distribution of network traffic does not fully reflect the differences of class sample features, resulting in the low accuracy of attack detection. To solve these problems, a novel approach called the fuzzy entropy weighted natural nearest neighbor(FEW-NNN) method is proposed to enhance the accuracy and efficiency of flowbased network traffic attack detection. First, the FEW-NNN method uses the Fisher score and deep graph feature learning algorithm to remove unimportant features and reduce the data dimension. Then, according to the proposed natural nearest neighbor searching algorithm(NNN_Searching), the density of data points, each class center and the smallest enclosing sphere radius are determined correspondingly. Finally, a fuzzy entropy weighted KNN classification method based on affinity is proposed, which mainly includes the following three steps: 1、 the feature weights of samples are calculated based on fuzzy entropy values, 2、 the fuzzy memberships of samples are determined based on affinity among samples, and 3、 K-neighbors are selected according to the class-conditional weighted Euclidean distance, the fuzzy membership value of the testing sample is calculated based on the membership of k-neighbors, and then all testing samples are classified according to the fuzzy membership value of the samples belonging to each class;that is, the attack type is determined. The method has been applied to the problem of attack detection and validated based on the famous KDD99 and CICIDS-2017 datasets. From the experimental results shown in this paper, it is observed that the FEW-NNN method improves the accuracy and efficiency of flow-based network traffic attack detection.
文摘在定位请求服务中,如何保护用户的位置隐私和位置服务提供商(Localization service provider,LSP)的数据隐私是关系到WiFi指纹定位应用的一个具有挑战性的问题。基于密文域的K-近邻(K-nearest neighbors,KNN)检索,本文提出了一种适用于三方的定位隐私保护算法,能有效提升对LSP指纹信息隐私的保护强度并降低计算开销。服务器和用户分别完成对指纹信息和定位请求的加密,而第三方则基于加密指纹库和加密定位请求,在隐私状态下完成对用户的位置估计。所提算法把各参考点的位置信息随机嵌入指纹,可避免恶意用户获取各参考点的具体位置;进一步利用布隆滤波器在隐藏接入点信息的情况下,第三方可完成参考点的在线匹配,实现对用户隐私状态下的粗定位,可与定位算法结合降低计算开销。在公共数据集和实验室数据集中,对两种算法的安全、开销和定位性能进行了全面的评估。与同类加密算法比较,在不降低定位精度的情况下,进一步增强了对数据隐私的保护。
文摘It is well known that the nonparametric estimation of the regression function is highly sensitive to the presence of even a small proportion of outliers in the data.To solve the problem of typical observations when the covariates of the nonparametric component are functional,the robust estimates for the regression parameter and regression operator are introduced.The main propose of the paper is to consider data-driven methods of selecting the number of neighbors in order to make the proposed processes fully automatic.We use thek Nearest Neighbors procedure(kNN)to construct the kernel estimator of the proposed robust model.Under some regularity conditions,we state consistency results for kNN functional estimators,which are uniform in the number of neighbors(UINN).Furthermore,a simulation study and an empirical application to a real data analysis of octane gasoline predictions are carried out to illustrate the higher predictive performances and the usefulness of the kNN approach.
基金supported in part by Shaanxi Natural Science Foundation Project (2023-JC-QN-0438)in part by Fundamental Research Funds for the Central Universities (2452021050).
文摘Winding is one of themost important components in power transformers.Ensuring the health state of the winding is of great importance to the stable operation of the power system.To efficiently and accurately diagnose the disc space variation(DSV)fault degree of transformer winding,this paper presents a diagnostic method of winding fault based on the K-Nearest Neighbor(KNN)algorithmand the frequency response analysis(FRA)method.First,a laboratory winding model is used,and DSV faults with four different degrees are achieved by changing disc space of the discs in the winding.Then,a series of FRA tests are conducted to obtain the FRA results and set up the FRA dataset.Second,ten different numerical indices are utilized to obtain features of FRA curves of faulted winding.Third,the 10-fold cross-validation method is employed to determine the optimal k-value of KNN.In addition,to improve the accuracy of the KNN model,a comparative analysis is made between the accuracy of the KNN algorithm and k-value under four distance functions.After getting the most appropriate distance metric and kvalue,the fault classificationmodel based on theKNN and FRA is constructed and it is used to classify the degrees of DSV faults.The identification accuracy rate of the proposed model is up to 98.30%.Finally,the performance of the model is presented by comparing with the support vector machine(SVM),SVM optimized by the particle swarmoptimization(PSO-SVM)method,and randomforest(RF).The results show that the diagnosis accuracy of the proposed model is the highest and the model can be used to accurately diagnose the DSV fault degrees of the winding.
文摘In this paper,Edgeworth expansion for the nearest neighbor\|kernel estimate and random weighting approximation of conditional density are given and the consistency and convergence rate are proved.
文摘Compositional data, such as relative information, is a crucial aspect of machine learning and other related fields. It is typically recorded as closed data or sums to a constant, like 100%. The statistical linear model is the most used technique for identifying hidden relationships between underlying random variables of interest. However, data quality is a significant challenge in machine learning, especially when missing data is present. The linear regression model is a commonly used statistical modeling technique used in various applications to find relationships between variables of interest. When estimating linear regression parameters which are useful for things like future prediction and partial effects analysis of independent variables, maximum likelihood estimation (MLE) is the method of choice. However, many datasets contain missing observations, which can lead to costly and time-consuming data recovery. To address this issue, the expectation-maximization (EM) algorithm has been suggested as a solution for situations including missing data. The EM algorithm repeatedly finds the best estimates of parameters in statistical models that depend on variables or data that have not been observed. This is called maximum likelihood or maximum a posteriori (MAP). Using the present estimate as input, the expectation (E) step constructs a log-likelihood function. Finding the parameters that maximize the anticipated log-likelihood, as determined in the E step, is the job of the maximization (M) phase. This study looked at how well the EM algorithm worked on a made-up compositional dataset with missing observations. It used both the robust least square version and ordinary least square regression techniques. The efficacy of the EM algorithm was compared with two alternative imputation techniques, k-Nearest Neighbor (k-NN) and mean imputation (), in terms of Aitchison distances and covariance.