Winding is one of themost important components in power transformers.Ensuring the health state of the winding is of great importance to the stable operation of the power system.To efficiently and accurately diagnose t...Winding is one of themost important components in power transformers.Ensuring the health state of the winding is of great importance to the stable operation of the power system.To efficiently and accurately diagnose the disc space variation(DSV)fault degree of transformer winding,this paper presents a diagnostic method of winding fault based on the K-Nearest Neighbor(KNN)algorithmand the frequency response analysis(FRA)method.First,a laboratory winding model is used,and DSV faults with four different degrees are achieved by changing disc space of the discs in the winding.Then,a series of FRA tests are conducted to obtain the FRA results and set up the FRA dataset.Second,ten different numerical indices are utilized to obtain features of FRA curves of faulted winding.Third,the 10-fold cross-validation method is employed to determine the optimal k-value of KNN.In addition,to improve the accuracy of the KNN model,a comparative analysis is made between the accuracy of the KNN algorithm and k-value under four distance functions.After getting the most appropriate distance metric and kvalue,the fault classificationmodel based on theKNN and FRA is constructed and it is used to classify the degrees of DSV faults.The identification accuracy rate of the proposed model is up to 98.30%.Finally,the performance of the model is presented by comparing with the support vector machine(SVM),SVM optimized by the particle swarmoptimization(PSO-SVM)method,and randomforest(RF).The results show that the diagnosis accuracy of the proposed model is the highest and the model can be used to accurately diagnose the DSV fault degrees of the winding.展开更多
Compositional data, such as relative information, is a crucial aspect of machine learning and other related fields. It is typically recorded as closed data or sums to a constant, like 100%. The statistical linear mode...Compositional data, such as relative information, is a crucial aspect of machine learning and other related fields. It is typically recorded as closed data or sums to a constant, like 100%. The statistical linear model is the most used technique for identifying hidden relationships between underlying random variables of interest. However, data quality is a significant challenge in machine learning, especially when missing data is present. The linear regression model is a commonly used statistical modeling technique used in various applications to find relationships between variables of interest. When estimating linear regression parameters which are useful for things like future prediction and partial effects analysis of independent variables, maximum likelihood estimation (MLE) is the method of choice. However, many datasets contain missing observations, which can lead to costly and time-consuming data recovery. To address this issue, the expectation-maximization (EM) algorithm has been suggested as a solution for situations including missing data. The EM algorithm repeatedly finds the best estimates of parameters in statistical models that depend on variables or data that have not been observed. This is called maximum likelihood or maximum a posteriori (MAP). Using the present estimate as input, the expectation (E) step constructs a log-likelihood function. Finding the parameters that maximize the anticipated log-likelihood, as determined in the E step, is the job of the maximization (M) phase. This study looked at how well the EM algorithm worked on a made-up compositional dataset with missing observations. It used both the robust least square version and ordinary least square regression techniques. The efficacy of the EM algorithm was compared with two alternative imputation techniques, k-Nearest Neighbor (k-NN) and mean imputation (), in terms of Aitchison distances and covariance.展开更多
In this paper,Edgeworth expansion for the nearest neighbor\|kernel estimate and random weighting approximation of conditional density are given and the consistency and convergence rate are proved.
G-protein coupled receptors (GPCRs) are a class of seven-helix transmembrane proteins that have been used in bioinformatics as the targets to facilitate drug discovery for human diseases. Although thousands of GPCR ...G-protein coupled receptors (GPCRs) are a class of seven-helix transmembrane proteins that have been used in bioinformatics as the targets to facilitate drug discovery for human diseases. Although thousands of GPCR sequences have been collected, the ligand specificity of many GPCRs is still unknown and only one crystal structure of the rhodopsin-like family has been solved. Therefore, identifying GPCR types only from sequence data has become an important research issue. In this study, a novel technique for identifying GPCR types based on the weighted Levenshtein distance between two receptor sequences and the nearest neighbor method (NNM) is introduced, which can deal with receptor sequences with different lengths directly. In our experiments for classifying four classes (acetylcholine, adrenoceptor, dopamine, and serotonin) of the rhodopsin-like family of GPCRs, the error rates from the leave-one-out procedure and the leave-half-out procedure were 0.62% and 1.24%, respectively. These results are prior to those of the covariant discriminant algorithm, the support vector machine method, and the NNM with Euclidean distance.展开更多
Consider the regression model Y=Xβ+ g(T) + e. Here g is an unknown smoothing function on [0, 1], β is a l-dimensional parameter to be estimated, and e is an unobserved error. When data are randomly censored, the est...Consider the regression model Y=Xβ+ g(T) + e. Here g is an unknown smoothing function on [0, 1], β is a l-dimensional parameter to be estimated, and e is an unobserved error. When data are randomly censored, the estimators βn* and gn*forβ and g are obtained by using class K and the least square methods. It is shown that βn* is asymptotically normal and gn* achieves the convergent rate O(n-1/3).展开更多
In this paper,the application of an algorithm for precipitation retrieval based on Himawari-8 (H8) satellite infrared data is studied.Based on GPM precipitation data and H8 Infrared spectrum channel brightness tempera...In this paper,the application of an algorithm for precipitation retrieval based on Himawari-8 (H8) satellite infrared data is studied.Based on GPM precipitation data and H8 Infrared spectrum channel brightness temperature data,corresponding "precipitation field dictionary" and "channel brightness temperature dictionary" are formed.The retrieval of precipitation field based on brightness temperature data is studied through the classification rule of k-nearest neighbor domain (KNN) and regularization constraint.Firstly,the corresponding "dictionary" is constructed according to the training sample database of the matched GPM precipitation data and H8 brightness temperature data.Secondly,according to the fact that precipitation characteristics in small organizations in different storm environments are often repeated,KNN is used to identify the spectral brightness temperature signal of "precipitation" and "non-precipitation" based on "the dictionary".Finally,the precipitation field retrieval is carried out in the precipitation signal "subspace" based on the regular term constraint method.In the process of retrieval,the contribution rate of brightness temperature retrieval of different channels was determined by Bayesian model averaging (BMA) model.The preliminary experimental results based on the "quantitative" evaluation indexes show that the precipitation of H8 retrieval has a good correlation with the GPM truth value,with a small error and similar structure.展开更多
Short-term traffic flow prediction is one of the essential issues in intelligent transportation systems(ITS). A new two-stage traffic flow prediction method named AKNN-AVL method is presented, which combines an advanc...Short-term traffic flow prediction is one of the essential issues in intelligent transportation systems(ITS). A new two-stage traffic flow prediction method named AKNN-AVL method is presented, which combines an advanced k-nearest neighbor(AKNN)method and balanced binary tree(AVL) data structure to improve the prediction accuracy. The AKNN method uses pattern recognition two times in the searching process, which considers the previous sequences of traffic flow to forecast the future traffic state. Clustering method and balanced binary tree technique are introduced to build case database to reduce the searching time. To illustrate the effects of these developments, the accuracies performance of AKNN-AVL method, k-nearest neighbor(KNN) method and the auto-regressive and moving average(ARMA) method are compared. These methods are calibrated and evaluated by the real-time data from a freeway traffic detector near North 3rd Ring Road in Beijing under both normal and incident traffic conditions.The comparisons show that the AKNN-AVL method with the optimal neighbor and pattern size outperforms both KNN method and ARMA method under both normal and incident traffic conditions. In addition, the combinations of clustering method and balanced binary tree technique to the prediction method can increase the searching speed and respond rapidly to case database fluctuations.展开更多
Various methods have been used to estimate the amount of above ground forest biomass across landscapes and to create biomass maps for specific stands or pixels across ownership or project areas. Without an accurate es...Various methods have been used to estimate the amount of above ground forest biomass across landscapes and to create biomass maps for specific stands or pixels across ownership or project areas. Without an accurate estimation method, land managers might end up with incorrect biomass estimate maps, which could lead them to make poorer decisions in their future management plans. The goal of this study was to compare various imputation methods to predict forest biomass and basal area, at a project planning scale (a combination of ground inventory plots, light detection and ranging (LiDAR) data, satellite imagery, and climate data was analyzed, and their root mean square error (RMSE) and bias were calculated. Results indicate that for biomass prediction, the k-nn (k = 5) had the lowest RMSE and least amount of bias. The second most accurate method consisted of the k-nn (k = 3), followed by the GWR model, and the random forest imputation. For basal area prediction, the GWR model had the lowest RMSE and least amount of bias. The second most accurate method was k-nn (k = 5), followed by k-nn (k = 3), and the random forest method. For both metrics, the GNN method was the least accurate based on the ranking of RMSE and bias.展开更多
The k-nearest neighbor (k-NN) method was evaluated to predict the influent flow rate and four water qualities, namely chemical oxygen demand (COD), suspended solid (SS), total nitrogen (T-N) and total phosphor...The k-nearest neighbor (k-NN) method was evaluated to predict the influent flow rate and four water qualities, namely chemical oxygen demand (COD), suspended solid (SS), total nitrogen (T-N) and total phosphorus (T-P) at a wastewater treatment plant (WWTP). The search range and approach for determining the number of nearest neighbors (NNs) under dry and wet weather conditions were initially optimized based on the root mean square error (RMSE). The optimum search range for considering data size was one year. The square root-based (SR) approach was superior to the distance factor-based (DF) approach in determining the appropriate number of NNs. However, the results for both approaches varied slightly depending on the water quality and the weather conditions. The influent flow rate was accurately predicted within one standard deviation of measured values. Influent water qualities were well predicted with the mean absolute percentage error (MAPE) under both wet and dry weather conditions. For the seven-day prediction, the difference in predictive accuracy was less than 5% in dry weather conditions and slightly worse in wet weather conditions. Overall, the k-NN method was verified to be useful for predicting WWTP influent characteristics.展开更多
基金supported in part by Shaanxi Natural Science Foundation Project (2023-JC-QN-0438)in part by Fundamental Research Funds for the Central Universities (2452021050).
文摘Winding is one of themost important components in power transformers.Ensuring the health state of the winding is of great importance to the stable operation of the power system.To efficiently and accurately diagnose the disc space variation(DSV)fault degree of transformer winding,this paper presents a diagnostic method of winding fault based on the K-Nearest Neighbor(KNN)algorithmand the frequency response analysis(FRA)method.First,a laboratory winding model is used,and DSV faults with four different degrees are achieved by changing disc space of the discs in the winding.Then,a series of FRA tests are conducted to obtain the FRA results and set up the FRA dataset.Second,ten different numerical indices are utilized to obtain features of FRA curves of faulted winding.Third,the 10-fold cross-validation method is employed to determine the optimal k-value of KNN.In addition,to improve the accuracy of the KNN model,a comparative analysis is made between the accuracy of the KNN algorithm and k-value under four distance functions.After getting the most appropriate distance metric and kvalue,the fault classificationmodel based on theKNN and FRA is constructed and it is used to classify the degrees of DSV faults.The identification accuracy rate of the proposed model is up to 98.30%.Finally,the performance of the model is presented by comparing with the support vector machine(SVM),SVM optimized by the particle swarmoptimization(PSO-SVM)method,and randomforest(RF).The results show that the diagnosis accuracy of the proposed model is the highest and the model can be used to accurately diagnose the DSV fault degrees of the winding.
文摘Compositional data, such as relative information, is a crucial aspect of machine learning and other related fields. It is typically recorded as closed data or sums to a constant, like 100%. The statistical linear model is the most used technique for identifying hidden relationships between underlying random variables of interest. However, data quality is a significant challenge in machine learning, especially when missing data is present. The linear regression model is a commonly used statistical modeling technique used in various applications to find relationships between variables of interest. When estimating linear regression parameters which are useful for things like future prediction and partial effects analysis of independent variables, maximum likelihood estimation (MLE) is the method of choice. However, many datasets contain missing observations, which can lead to costly and time-consuming data recovery. To address this issue, the expectation-maximization (EM) algorithm has been suggested as a solution for situations including missing data. The EM algorithm repeatedly finds the best estimates of parameters in statistical models that depend on variables or data that have not been observed. This is called maximum likelihood or maximum a posteriori (MAP). Using the present estimate as input, the expectation (E) step constructs a log-likelihood function. Finding the parameters that maximize the anticipated log-likelihood, as determined in the E step, is the job of the maximization (M) phase. This study looked at how well the EM algorithm worked on a made-up compositional dataset with missing observations. It used both the robust least square version and ordinary least square regression techniques. The efficacy of the EM algorithm was compared with two alternative imputation techniques, k-Nearest Neighbor (k-NN) and mean imputation (), in terms of Aitchison distances and covariance.
文摘In this paper,Edgeworth expansion for the nearest neighbor\|kernel estimate and random weighting approximation of conditional density are given and the consistency and convergence rate are proved.
基金supported by the Natural Science Foundation of Jiangsu Province(No.BK2004142)partly by the National Natural Science Foundation of China(No.60275007).
文摘G-protein coupled receptors (GPCRs) are a class of seven-helix transmembrane proteins that have been used in bioinformatics as the targets to facilitate drug discovery for human diseases. Although thousands of GPCR sequences have been collected, the ligand specificity of many GPCRs is still unknown and only one crystal structure of the rhodopsin-like family has been solved. Therefore, identifying GPCR types only from sequence data has become an important research issue. In this study, a novel technique for identifying GPCR types based on the weighted Levenshtein distance between two receptor sequences and the nearest neighbor method (NNM) is introduced, which can deal with receptor sequences with different lengths directly. In our experiments for classifying four classes (acetylcholine, adrenoceptor, dopamine, and serotonin) of the rhodopsin-like family of GPCRs, the error rates from the leave-one-out procedure and the leave-half-out procedure were 0.62% and 1.24%, respectively. These results are prior to those of the covariant discriminant algorithm, the support vector machine method, and the NNM with Euclidean distance.
文摘Consider the regression model Y=Xβ+ g(T) + e. Here g is an unknown smoothing function on [0, 1], β is a l-dimensional parameter to be estimated, and e is an unobserved error. When data are randomly censored, the estimators βn* and gn*forβ and g are obtained by using class K and the least square methods. It is shown that βn* is asymptotically normal and gn* achieves the convergent rate O(n-1/3).
基金Supported by National Natural Science Foundation of China(41805080)Natural Science Foundation of Anhui Province,China(1708085QD89)+1 种基金Key Research and Development Program Projects of Anhui Province,China(201904a07020099)Open Foundation Project Shenyang Institute of Atmospheric Environment,China Meteorological Administration(2016SYIAE14)
文摘In this paper,the application of an algorithm for precipitation retrieval based on Himawari-8 (H8) satellite infrared data is studied.Based on GPM precipitation data and H8 Infrared spectrum channel brightness temperature data,corresponding "precipitation field dictionary" and "channel brightness temperature dictionary" are formed.The retrieval of precipitation field based on brightness temperature data is studied through the classification rule of k-nearest neighbor domain (KNN) and regularization constraint.Firstly,the corresponding "dictionary" is constructed according to the training sample database of the matched GPM precipitation data and H8 brightness temperature data.Secondly,according to the fact that precipitation characteristics in small organizations in different storm environments are often repeated,KNN is used to identify the spectral brightness temperature signal of "precipitation" and "non-precipitation" based on "the dictionary".Finally,the precipitation field retrieval is carried out in the precipitation signal "subspace" based on the regular term constraint method.In the process of retrieval,the contribution rate of brightness temperature retrieval of different channels was determined by Bayesian model averaging (BMA) model.The preliminary experimental results based on the "quantitative" evaluation indexes show that the precipitation of H8 retrieval has a good correlation with the GPM truth value,with a small error and similar structure.
基金Project(2012CB725403)supported by the National Basic Research Program of ChinaProjects(71210001,51338008)supported by the National Natural Science Foundation of ChinaProject supported by World Capital Cities Smooth Traffic Collaborative Innovation Center and Singapore National Research Foundation Under Its Campus for Research Excellence and Technology Enterprise(CREATE)Programme
文摘Short-term traffic flow prediction is one of the essential issues in intelligent transportation systems(ITS). A new two-stage traffic flow prediction method named AKNN-AVL method is presented, which combines an advanced k-nearest neighbor(AKNN)method and balanced binary tree(AVL) data structure to improve the prediction accuracy. The AKNN method uses pattern recognition two times in the searching process, which considers the previous sequences of traffic flow to forecast the future traffic state. Clustering method and balanced binary tree technique are introduced to build case database to reduce the searching time. To illustrate the effects of these developments, the accuracies performance of AKNN-AVL method, k-nearest neighbor(KNN) method and the auto-regressive and moving average(ARMA) method are compared. These methods are calibrated and evaluated by the real-time data from a freeway traffic detector near North 3rd Ring Road in Beijing under both normal and incident traffic conditions.The comparisons show that the AKNN-AVL method with the optimal neighbor and pattern size outperforms both KNN method and ARMA method under both normal and incident traffic conditions. In addition, the combinations of clustering method and balanced binary tree technique to the prediction method can increase the searching speed and respond rapidly to case database fluctuations.
文摘Various methods have been used to estimate the amount of above ground forest biomass across landscapes and to create biomass maps for specific stands or pixels across ownership or project areas. Without an accurate estimation method, land managers might end up with incorrect biomass estimate maps, which could lead them to make poorer decisions in their future management plans. The goal of this study was to compare various imputation methods to predict forest biomass and basal area, at a project planning scale (a combination of ground inventory plots, light detection and ranging (LiDAR) data, satellite imagery, and climate data was analyzed, and their root mean square error (RMSE) and bias were calculated. Results indicate that for biomass prediction, the k-nn (k = 5) had the lowest RMSE and least amount of bias. The second most accurate method consisted of the k-nn (k = 3), followed by the GWR model, and the random forest imputation. For basal area prediction, the GWR model had the lowest RMSE and least amount of bias. The second most accurate method was k-nn (k = 5), followed by k-nn (k = 3), and the random forest method. For both metrics, the GNN method was the least accurate based on the ranking of RMSE and bias.
文摘The k-nearest neighbor (k-NN) method was evaluated to predict the influent flow rate and four water qualities, namely chemical oxygen demand (COD), suspended solid (SS), total nitrogen (T-N) and total phosphorus (T-P) at a wastewater treatment plant (WWTP). The search range and approach for determining the number of nearest neighbors (NNs) under dry and wet weather conditions were initially optimized based on the root mean square error (RMSE). The optimum search range for considering data size was one year. The square root-based (SR) approach was superior to the distance factor-based (DF) approach in determining the appropriate number of NNs. However, the results for both approaches varied slightly depending on the water quality and the weather conditions. The influent flow rate was accurately predicted within one standard deviation of measured values. Influent water qualities were well predicted with the mean absolute percentage error (MAPE) under both wet and dry weather conditions. For the seven-day prediction, the difference in predictive accuracy was less than 5% in dry weather conditions and slightly worse in wet weather conditions. Overall, the k-NN method was verified to be useful for predicting WWTP influent characteristics.