On the basis of machine leaning,suitable algorithms can make advanced time series analysis.This paper proposes a complex k-nearest neighbor(KNN)model for predicting financial time series.This model uses a complex feat...On the basis of machine leaning,suitable algorithms can make advanced time series analysis.This paper proposes a complex k-nearest neighbor(KNN)model for predicting financial time series.This model uses a complex feature extraction process integrating a forward rolling empirical mode decomposition(EMD)for financial time series signal analysis and principal component analysis(PCA)for the dimension reduction.The information-rich features are extracted then input to a weighted KNN classifier where the features are weighted with PCA loading.Finally,prediction is generated via regression on the selected nearest neighbors.The structure of the model as a whole is original.The test results on real historical data sets confirm the effectiveness of the models for predicting the Chinese stock index,an individual stock,and the EUR/USD exchange rate.展开更多
The k-Nearest Neighbor method is one of the most popular techniques for both classification and regression purposes.Because of its operation,the application of this classification may be limited to problems with a cer...The k-Nearest Neighbor method is one of the most popular techniques for both classification and regression purposes.Because of its operation,the application of this classification may be limited to problems with a certain number of instances,particularly,when run time is a consideration.However,the classification of large amounts of data has become a fundamental task in many real-world applications.It is logical to scale the k-Nearest Neighbor method to large scale datasets.This paper proposes a new k-Nearest Neighbor classification method(KNN-CCL)which uses a parallel centroid-based and hierarchical clustering algorithm to separate the sample of training dataset into multiple parts.The introduced clustering algorithm uses four stages of successive refinements and generates high quality clusters.The k-Nearest Neighbor approach subsequently makes use of them to predict the test datasets.Finally,sets of experiments are conducted on the UCI datasets.The experimental results confirm that the proposed k-Nearest Neighbor classification method performs well with regard to classification accuracy and performance.展开更多
Interference signals recognition plays an important role in anti-jamming communication.With the development of deep learning,many supervised interference signals recognition algorithms based on deep learning have emer...Interference signals recognition plays an important role in anti-jamming communication.With the development of deep learning,many supervised interference signals recognition algorithms based on deep learning have emerged recently and show better performance than traditional recognition algorithms.However,there is no unsupervised interference signals recognition algorithm at present.In this paper,an unsupervised interference signals recognition method called double phases and double dimensions contrastive clustering(DDCC)is proposed.Specifically,in the first phase,four data augmentation strategies for interference signals are used in data-augmentation-based(DA-based)contrastive learning.In the second phase,the original dataset’s k-nearest neighbor set(KNNset)is designed in double dimensions contrastive learning.In addition,a dynamic entropy parameter strategy is proposed.The simulation experiments of 9 types of interference signals show that random cropping is the best one of the four data augmentation strategies;the feature dimensional contrastive learning in the second phase can improve the clustering purity;the dynamic entropy parameter strategy can improve the stability of DDCC effectively.The unsupervised interference signals recognition results of DDCC and five other deep clustering algorithms show that the clustering performance of DDCC is superior to other algorithms.In particular,the clustering purity of our method is above 92%,SCAN’s is 81%,and the other three methods’are below 71%when jammingnoise-ratio(JNR)is−5 dB.In addition,our method is close to the supervised learning algorithm.展开更多
Winding is one of themost important components in power transformers.Ensuring the health state of the winding is of great importance to the stable operation of the power system.To efficiently and accurately diagnose t...Winding is one of themost important components in power transformers.Ensuring the health state of the winding is of great importance to the stable operation of the power system.To efficiently and accurately diagnose the disc space variation(DSV)fault degree of transformer winding,this paper presents a diagnostic method of winding fault based on the K-Nearest Neighbor(KNN)algorithmand the frequency response analysis(FRA)method.First,a laboratory winding model is used,and DSV faults with four different degrees are achieved by changing disc space of the discs in the winding.Then,a series of FRA tests are conducted to obtain the FRA results and set up the FRA dataset.Second,ten different numerical indices are utilized to obtain features of FRA curves of faulted winding.Third,the 10-fold cross-validation method is employed to determine the optimal k-value of KNN.In addition,to improve the accuracy of the KNN model,a comparative analysis is made between the accuracy of the KNN algorithm and k-value under four distance functions.After getting the most appropriate distance metric and kvalue,the fault classificationmodel based on theKNN and FRA is constructed and it is used to classify the degrees of DSV faults.The identification accuracy rate of the proposed model is up to 98.30%.Finally,the performance of the model is presented by comparing with the support vector machine(SVM),SVM optimized by the particle swarmoptimization(PSO-SVM)method,and randomforest(RF).The results show that the diagnosis accuracy of the proposed model is the highest and the model can be used to accurately diagnose the DSV fault degrees of the winding.展开更多
Machine learning algorithms (MLs) can potentially improve disease diagnostics, leading to early detection and treatment of these diseases. As a malignant tumor whose primary focus is located in the bronchial mucosal e...Machine learning algorithms (MLs) can potentially improve disease diagnostics, leading to early detection and treatment of these diseases. As a malignant tumor whose primary focus is located in the bronchial mucosal epithelium, lung cancer has the highest mortality and morbidity among cancer types, threatening health and life of patients suffering from the disease. Machine learning algorithms such as Random Forest (RF), Support Vector Machine (SVM), K-Nearest Neighbor (KNN) and Naïve Bayes (NB) have been used for lung cancer prediction. However they still face challenges such as high dimensionality of the feature space, over-fitting, high computational complexity, noise and missing data, low accuracies, low precision and high error rates. Ensemble learning, which combines classifiers, may be helpful to boost prediction on new data. However, current ensemble ML techniques rarely consider comprehensive evaluation metrics to evaluate the performance of individual classifiers. The main purpose of this study was to develop an ensemble classifier that improves lung cancer prediction. An ensemble machine learning algorithm is developed based on RF, SVM, NB, and KNN. Feature selection is done based on Principal Component Analysis (PCA) and Analysis of Variance (ANOVA). This algorithm is then executed on lung cancer data and evaluated using execution time, true positives (TP), true negatives (TN), false positives (FP), false negatives (FN), false positive rate (FPR), recall (R), precision (P) and F-measure (FM). Experimental results show that the proposed ensemble classifier has the best classification of 0.9825% with the lowest error rate of 0.0193. This is followed by SVM in which the probability of having the best classification is 0.9652% at an error rate of 0.0206. On the other hand, NB had the worst performance of 0.8475% classification at 0.0738 error rate.展开更多
Air quality is a critical concern for public health and environmental regulation. The Air Quality Index (AQI), a widely adopted index by the US Environmental Protection Agency (EPA), serves as a crucial metric for rep...Air quality is a critical concern for public health and environmental regulation. The Air Quality Index (AQI), a widely adopted index by the US Environmental Protection Agency (EPA), serves as a crucial metric for reporting site-specific air pollution levels. Accurately predicting air quality, as measured by the AQI, is essential for effective air pollution management. In this study, we aim to identify the most reliable regression model among linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), logistic regression, and K-nearest neighbors (KNN). We conducted four different regression analyses using a machine learning approach to determine the model with the best performance. By employing the confusion matrix and error percentages, we selected the best-performing model, which yielded prediction error rates of 22%, 23%, 20%, and 27%, respectively, for LDA, QDA, logistic regression, and KNN models. The logistic regression model outperformed the other three statistical models in predicting AQI. Understanding these models' performance can help address an existing gap in air quality research and contribute to the integration of regression techniques in AQI studies, ultimately benefiting stakeholders like environmental regulators, healthcare professionals, urban planners, and researchers.展开更多
A motion information analysis system based on the acceleration data is proposed in this paper,consisting of filtering,feature extraction and classification.The Kalman filter is adopted to eliminate the noise.With the ...A motion information analysis system based on the acceleration data is proposed in this paper,consisting of filtering,feature extraction and classification.The Kalman filter is adopted to eliminate the noise.With the time-domain and frequency-domain analysis,acceleration features like the amplitude,the period and the acceleration region values are obtained.Furthermore,the accuracy of the motion classification is improved by using the k-nearest neighbor (KNN) algorithm.展开更多
Objective To detect unknown network worm at its early propagation stage. Methods On the basis of characteristics of network worm attack, the concept of failed connection flow (FCT) was defined. Based on wavelet packet...Objective To detect unknown network worm at its early propagation stage. Methods On the basis of characteristics of network worm attack, the concept of failed connection flow (FCT) was defined. Based on wavelet packet analysis of FCT time series, this method computed the energy associated with each wavelet packet of FCT time series, transformed the FCT time series into a series of energy distribution vector on frequency domain, then a trained K-nearest neighbor (KNN) classifier was applied to identify the worm. Results The experiment showed that the method could identify network worm when the worm started to scan. Compared to theoretic value, the identification error ratio was 5.69%. Conclusion The method can detect unknown network worm at its early propagation stage effectively.展开更多
We developed a software performing laminae counting, thickness measurements, spectral and wavelet analysis of laminated sediments embedded signal. We validated the software on varved sediments. Varved laminae are auto...We developed a software performing laminae counting, thickness measurements, spectral and wavelet analysis of laminated sediments embedded signal. We validated the software on varved sediments. Varved laminae are automatically counted using an image analysis classification method based on K-Nearest Neighbors (KNN) algorithm. In a next step, the signal corresponding to varved black laminae thickness variation is retrieved. The obtained signal is a good proxy to study the paleoclimatic constraints controlling sedimentation. Finally, the use of spectral and wavelet analysis methods on the variation of black laminae thickness revealed the existence of frequencies and periods which can be linked to known paleoclimatic events.展开更多
The EMG signal which is generated by the muscles activity diffuses to the skin surface of human body. This paper presents a pattern recognition system based on Linear Discriminant Analysis (LDA) algorithm for the clas...The EMG signal which is generated by the muscles activity diffuses to the skin surface of human body. This paper presents a pattern recognition system based on Linear Discriminant Analysis (LDA) algorithm for the classification of upper arm motions;where this algorithm was mainly used in face recognition and voice recognition. Also a comparison between the Linear Discriminant Analysis (LDA) and k-Nearest Neighbor (k-NN) algorithm is made for the classification of upper arm motions. The obtained results demonstrate superior performance of LDA to k-NN. The classification results give very accurate classification with very small classification errors. This paper is organized as follows: Muscle Anatomy, Data Classification Methods, Theory of Linear Discriminant Analysis, k-Nearest Neighbor (kNN) Algorithm, Modeling of EMG Pattern Recognition, EMG Data Generator, Electromyography Feature Extraction, Implemented System Results and Discussions, and finally, Conclusions. The proposed structure is simulated using MATLAB.展开更多
Text categorization is a significant technique to manage the surging text data on the Internet.The k-nearest neighbors(kNN) algorithm is an effective,but not efficient,classification model for text categorization.In t...Text categorization is a significant technique to manage the surging text data on the Internet.The k-nearest neighbors(kNN) algorithm is an effective,but not efficient,classification model for text categorization.In this paper,we propose an effective strategy to accelerate the standard kNN,based on a simple principle:usually,near points in space are also near when they are projected into a direction,which means that distant points in the projection direction are also distant in the original space.Using the proposed strategy,most of the irrelevant points can be removed when searching for the k-nearest neighbors of a query point,which greatly decreases the computation cost.Experimental results show that the proposed strategy greatly improves the time performance of the standard kNN,with little degradation in accuracy.Specifically,it is superior in applications that have large and high-dimensional datasets.展开更多
Missing values are prevalent in real-world datasets and they may reduce predictive performance of a learning algorithm. Dissolved Gas Analysis (DGA), one of the most deployable methods for detecting and predicting inc...Missing values are prevalent in real-world datasets and they may reduce predictive performance of a learning algorithm. Dissolved Gas Analysis (DGA), one of the most deployable methods for detecting and predicting incipient faults in power transformers is one of the casualties. Thus, this paper proposes filling-in the missing values found in a DGA dataset using the k-nearest neighbor imputation method with two different distance metrics: Euclidean and Cityblock. Thereafter, using these imputed datasets as inputs, this study applies Support Vector Machine (SVM) to built models which are used to classify transformer faults. Experimental results are provided to show the effectiveness of the proposed approach.展开更多
Traditional clustering algorithms often struggle to produce satisfactory results when dealing with datasets withuneven density. Additionally, they incur substantial computational costs when applied to high-dimensional...Traditional clustering algorithms often struggle to produce satisfactory results when dealing with datasets withuneven density. Additionally, they incur substantial computational costs when applied to high-dimensional datadue to calculating similarity matrices. To alleviate these issues, we employ the KD-Tree to partition the dataset andcompute the K-nearest neighbors (KNN) density for each point, thereby avoiding the computation of similaritymatrices. Moreover, we apply the rules of voting elections, treating each data point as a voter and casting a votefor the point with the highest density among its KNN. By utilizing the vote counts of each point, we develop thestrategy for classifying noise points and potential cluster centers, allowing the algorithm to identify clusters withuneven density and complex shapes. Additionally, we define the concept of “adhesive points” between two clustersto merge adjacent clusters that have similar densities. This process helps us identify the optimal number of clustersautomatically. Experimental results indicate that our algorithm not only improves the efficiency of clustering butalso increases its accuracy.展开更多
流形数据由一些弧线状或环状的类簇组成,其特点是同一类簇的样本间距离差距较大。密度峰值聚类算法不能有效识别流形类簇的类簇中心且分配剩余样本时易引发样本的连续误分配问题。为此,本文提出面向流形数据的共享近邻密度峰值聚类(dens...流形数据由一些弧线状或环状的类簇组成,其特点是同一类簇的样本间距离差距较大。密度峰值聚类算法不能有效识别流形类簇的类簇中心且分配剩余样本时易引发样本的连续误分配问题。为此,本文提出面向流形数据的共享近邻密度峰值聚类(density peaks clustering based on shared nearest neighbor for manifold datasets,DPC-SNN)算法。提出了一种基于共享近邻的样本相似度定义方式,使得同一流形类簇样本间的相似度尽可能高;基于上述相似度定义局部密度,不忽略距类簇中心较远样本的密度贡献,能更好地区分出流形类簇的类簇中心与其他样本;根据样本的相似度分配剩余样本,避免了样本的连续误分配。DPC-SNN算法与DPC、FKNNDPC、FNDPC、DPCSA及IDPC-FA算法的对比实验结果表明,DPC-SNN算法能够有效发现流形数据的类簇中心并准确完成聚类,对真实以及人脸数据集也有不错的聚类效果。展开更多
密度峰值聚类(density peaks clustering,DPC)是一种基于密度的聚类算法,该算法可以直观地确定类簇数量,识别任意形状的类簇,并且自动检测、排除异常点.然而,DPC仍存在些许不足:一方面,DPC算法仅考虑全局分布,在类簇密度差距较大的数据...密度峰值聚类(density peaks clustering,DPC)是一种基于密度的聚类算法,该算法可以直观地确定类簇数量,识别任意形状的类簇,并且自动检测、排除异常点.然而,DPC仍存在些许不足:一方面,DPC算法仅考虑全局分布,在类簇密度差距较大的数据集聚类效果较差;另一方面,DPC中点的分配策略容易导致“多米诺效应”.为此,基于代表点(representative points)与K近邻(K-nearest neighbors,KNN)提出了RKNN-DPC算法.首先,构造了K近邻密度,再引入代表点刻画样本的全局分布,提出了新的局部密度;然后,利用样本的K近邻信息,提出一种加权的K近邻分配策略以缓解“多米诺效应”;最后,在人工数据集和真实数据集上与5种聚类算法进行了对比实验,实验结果表明,所提出的RKNN-DPC可以更准确地识别类簇中心并且获得更好的聚类结果.展开更多
The Feixianguan Formation reservoirs in northeastern Sichuan are mainly a suite of carbonate platform deposits.The reservoir types are diverse with high heterogeneity and complex genetic mechanisms.Pores,vugs and frac...The Feixianguan Formation reservoirs in northeastern Sichuan are mainly a suite of carbonate platform deposits.The reservoir types are diverse with high heterogeneity and complex genetic mechanisms.Pores,vugs and fractures of different genetic mechanisms and scales are often developed in association,and it is difficult to classify reservoir types merely based on static data such as outcrop observation,and cores and logging data.In the study,the reservoirs in the Feixianguan Formation are grouped into five types by combining dynamic and static data,that is,karst breccia-residual vuggy type,solution-enhanced vuggy type,fractured-vuggy type,fractured type and matrix type(non-reservoir).Based on conventional logging data,core data and formation microscanner image(FMI)data of the Qilibei block,northeastern Sichuan Basin,the reservoirs are classified in accordance with fracture-vug matching relationship.Based on the principle of cluster analysis,K-Nearest Neighbor(KNN)classification templates are established,and the applicability of the model is verified by using the reservoir data from wells uninvolved in modeling.Following the analysis of the results of reservoir type discrimination and the production of corresponding reservoir intervals,the contributions of various reservoir types to production are evaluated and the reliability of reservoir type classification is verified.The results show that the solution-enhanced vuggy type is of high-quality sweet spot reservoir in the study area with good physical property and high gas production,followed by the fractured-vuggy type,and the fractured and karst breccia-residual vuggy types are the least promising.展开更多
基金supported by the Social Science Foundation of China under Grant No.17BGL231。
文摘On the basis of machine leaning,suitable algorithms can make advanced time series analysis.This paper proposes a complex k-nearest neighbor(KNN)model for predicting financial time series.This model uses a complex feature extraction process integrating a forward rolling empirical mode decomposition(EMD)for financial time series signal analysis and principal component analysis(PCA)for the dimension reduction.The information-rich features are extracted then input to a weighted KNN classifier where the features are weighted with PCA loading.Finally,prediction is generated via regression on the selected nearest neighbors.The structure of the model as a whole is original.The test results on real historical data sets confirm the effectiveness of the models for predicting the Chinese stock index,an individual stock,and the EUR/USD exchange rate.
基金The authors received no specific funding for this work.
文摘The k-Nearest Neighbor method is one of the most popular techniques for both classification and regression purposes.Because of its operation,the application of this classification may be limited to problems with a certain number of instances,particularly,when run time is a consideration.However,the classification of large amounts of data has become a fundamental task in many real-world applications.It is logical to scale the k-Nearest Neighbor method to large scale datasets.This paper proposes a new k-Nearest Neighbor classification method(KNN-CCL)which uses a parallel centroid-based and hierarchical clustering algorithm to separate the sample of training dataset into multiple parts.The introduced clustering algorithm uses four stages of successive refinements and generates high quality clusters.The k-Nearest Neighbor approach subsequently makes use of them to predict the test datasets.Finally,sets of experiments are conducted on the UCI datasets.The experimental results confirm that the proposed k-Nearest Neighbor classification method performs well with regard to classification accuracy and performance.
基金This research was supported by the National Natural Science Foundation of China under Grant No.U19B2016.,and Zhejiang Provincial Key Lab of Data Storage and Transmission Technology,Hangzhou Dianzi University.
文摘Interference signals recognition plays an important role in anti-jamming communication.With the development of deep learning,many supervised interference signals recognition algorithms based on deep learning have emerged recently and show better performance than traditional recognition algorithms.However,there is no unsupervised interference signals recognition algorithm at present.In this paper,an unsupervised interference signals recognition method called double phases and double dimensions contrastive clustering(DDCC)is proposed.Specifically,in the first phase,four data augmentation strategies for interference signals are used in data-augmentation-based(DA-based)contrastive learning.In the second phase,the original dataset’s k-nearest neighbor set(KNNset)is designed in double dimensions contrastive learning.In addition,a dynamic entropy parameter strategy is proposed.The simulation experiments of 9 types of interference signals show that random cropping is the best one of the four data augmentation strategies;the feature dimensional contrastive learning in the second phase can improve the clustering purity;the dynamic entropy parameter strategy can improve the stability of DDCC effectively.The unsupervised interference signals recognition results of DDCC and five other deep clustering algorithms show that the clustering performance of DDCC is superior to other algorithms.In particular,the clustering purity of our method is above 92%,SCAN’s is 81%,and the other three methods’are below 71%when jammingnoise-ratio(JNR)is−5 dB.In addition,our method is close to the supervised learning algorithm.
基金supported in part by Shaanxi Natural Science Foundation Project (2023-JC-QN-0438)in part by Fundamental Research Funds for the Central Universities (2452021050).
文摘Winding is one of themost important components in power transformers.Ensuring the health state of the winding is of great importance to the stable operation of the power system.To efficiently and accurately diagnose the disc space variation(DSV)fault degree of transformer winding,this paper presents a diagnostic method of winding fault based on the K-Nearest Neighbor(KNN)algorithmand the frequency response analysis(FRA)method.First,a laboratory winding model is used,and DSV faults with four different degrees are achieved by changing disc space of the discs in the winding.Then,a series of FRA tests are conducted to obtain the FRA results and set up the FRA dataset.Second,ten different numerical indices are utilized to obtain features of FRA curves of faulted winding.Third,the 10-fold cross-validation method is employed to determine the optimal k-value of KNN.In addition,to improve the accuracy of the KNN model,a comparative analysis is made between the accuracy of the KNN algorithm and k-value under four distance functions.After getting the most appropriate distance metric and kvalue,the fault classificationmodel based on theKNN and FRA is constructed and it is used to classify the degrees of DSV faults.The identification accuracy rate of the proposed model is up to 98.30%.Finally,the performance of the model is presented by comparing with the support vector machine(SVM),SVM optimized by the particle swarmoptimization(PSO-SVM)method,and randomforest(RF).The results show that the diagnosis accuracy of the proposed model is the highest and the model can be used to accurately diagnose the DSV fault degrees of the winding.
文摘Machine learning algorithms (MLs) can potentially improve disease diagnostics, leading to early detection and treatment of these diseases. As a malignant tumor whose primary focus is located in the bronchial mucosal epithelium, lung cancer has the highest mortality and morbidity among cancer types, threatening health and life of patients suffering from the disease. Machine learning algorithms such as Random Forest (RF), Support Vector Machine (SVM), K-Nearest Neighbor (KNN) and Naïve Bayes (NB) have been used for lung cancer prediction. However they still face challenges such as high dimensionality of the feature space, over-fitting, high computational complexity, noise and missing data, low accuracies, low precision and high error rates. Ensemble learning, which combines classifiers, may be helpful to boost prediction on new data. However, current ensemble ML techniques rarely consider comprehensive evaluation metrics to evaluate the performance of individual classifiers. The main purpose of this study was to develop an ensemble classifier that improves lung cancer prediction. An ensemble machine learning algorithm is developed based on RF, SVM, NB, and KNN. Feature selection is done based on Principal Component Analysis (PCA) and Analysis of Variance (ANOVA). This algorithm is then executed on lung cancer data and evaluated using execution time, true positives (TP), true negatives (TN), false positives (FP), false negatives (FN), false positive rate (FPR), recall (R), precision (P) and F-measure (FM). Experimental results show that the proposed ensemble classifier has the best classification of 0.9825% with the lowest error rate of 0.0193. This is followed by SVM in which the probability of having the best classification is 0.9652% at an error rate of 0.0206. On the other hand, NB had the worst performance of 0.8475% classification at 0.0738 error rate.
文摘Air quality is a critical concern for public health and environmental regulation. The Air Quality Index (AQI), a widely adopted index by the US Environmental Protection Agency (EPA), serves as a crucial metric for reporting site-specific air pollution levels. Accurately predicting air quality, as measured by the AQI, is essential for effective air pollution management. In this study, we aim to identify the most reliable regression model among linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), logistic regression, and K-nearest neighbors (KNN). We conducted four different regression analyses using a machine learning approach to determine the model with the best performance. By employing the confusion matrix and error percentages, we selected the best-performing model, which yielded prediction error rates of 22%, 23%, 20%, and 27%, respectively, for LDA, QDA, logistic regression, and KNN models. The logistic regression model outperformed the other three statistical models in predicting AQI. Understanding these models' performance can help address an existing gap in air quality research and contribute to the integration of regression techniques in AQI studies, ultimately benefiting stakeholders like environmental regulators, healthcare professionals, urban planners, and researchers.
基金supported by the In-shoe Triaxial Pressure Measurement (Grant No.07DZ12077)and the Shanghai Innovation Project
文摘A motion information analysis system based on the acceleration data is proposed in this paper,consisting of filtering,feature extraction and classification.The Kalman filter is adopted to eliminate the noise.With the time-domain and frequency-domain analysis,acceleration features like the amplitude,the period and the acceleration region values are obtained.Furthermore,the accuracy of the motion classification is improved by using the k-nearest neighbor (KNN) algorithm.
基金This work was supported by National "863" programof China (No.2003AA148010) and National Torch Project of China (No.2005EB011484) .
文摘Objective To detect unknown network worm at its early propagation stage. Methods On the basis of characteristics of network worm attack, the concept of failed connection flow (FCT) was defined. Based on wavelet packet analysis of FCT time series, this method computed the energy associated with each wavelet packet of FCT time series, transformed the FCT time series into a series of energy distribution vector on frequency domain, then a trained K-nearest neighbor (KNN) classifier was applied to identify the worm. Results The experiment showed that the method could identify network worm when the worm started to scan. Compared to theoretic value, the identification error ratio was 5.69%. Conclusion The method can detect unknown network worm at its early propagation stage effectively.
文摘We developed a software performing laminae counting, thickness measurements, spectral and wavelet analysis of laminated sediments embedded signal. We validated the software on varved sediments. Varved laminae are automatically counted using an image analysis classification method based on K-Nearest Neighbors (KNN) algorithm. In a next step, the signal corresponding to varved black laminae thickness variation is retrieved. The obtained signal is a good proxy to study the paleoclimatic constraints controlling sedimentation. Finally, the use of spectral and wavelet analysis methods on the variation of black laminae thickness revealed the existence of frequencies and periods which can be linked to known paleoclimatic events.
文摘The EMG signal which is generated by the muscles activity diffuses to the skin surface of human body. This paper presents a pattern recognition system based on Linear Discriminant Analysis (LDA) algorithm for the classification of upper arm motions;where this algorithm was mainly used in face recognition and voice recognition. Also a comparison between the Linear Discriminant Analysis (LDA) and k-Nearest Neighbor (k-NN) algorithm is made for the classification of upper arm motions. The obtained results demonstrate superior performance of LDA to k-NN. The classification results give very accurate classification with very small classification errors. This paper is organized as follows: Muscle Anatomy, Data Classification Methods, Theory of Linear Discriminant Analysis, k-Nearest Neighbor (kNN) Algorithm, Modeling of EMG Pattern Recognition, EMG Data Generator, Electromyography Feature Extraction, Implemented System Results and Discussions, and finally, Conclusions. The proposed structure is simulated using MATLAB.
基金Project (No. 2012BAH18B05) supported by the National Key Technology R&D Program of China
文摘Text categorization is a significant technique to manage the surging text data on the Internet.The k-nearest neighbors(kNN) algorithm is an effective,but not efficient,classification model for text categorization.In this paper,we propose an effective strategy to accelerate the standard kNN,based on a simple principle:usually,near points in space are also near when they are projected into a direction,which means that distant points in the projection direction are also distant in the original space.Using the proposed strategy,most of the irrelevant points can be removed when searching for the k-nearest neighbors of a query point,which greatly decreases the computation cost.Experimental results show that the proposed strategy greatly improves the time performance of the standard kNN,with little degradation in accuracy.Specifically,it is superior in applications that have large and high-dimensional datasets.
文摘Missing values are prevalent in real-world datasets and they may reduce predictive performance of a learning algorithm. Dissolved Gas Analysis (DGA), one of the most deployable methods for detecting and predicting incipient faults in power transformers is one of the casualties. Thus, this paper proposes filling-in the missing values found in a DGA dataset using the k-nearest neighbor imputation method with two different distance metrics: Euclidean and Cityblock. Thereafter, using these imputed datasets as inputs, this study applies Support Vector Machine (SVM) to built models which are used to classify transformer faults. Experimental results are provided to show the effectiveness of the proposed approach.
基金National Natural Science Foundation of China Nos.61962054 and 62372353.
文摘Traditional clustering algorithms often struggle to produce satisfactory results when dealing with datasets withuneven density. Additionally, they incur substantial computational costs when applied to high-dimensional datadue to calculating similarity matrices. To alleviate these issues, we employ the KD-Tree to partition the dataset andcompute the K-nearest neighbors (KNN) density for each point, thereby avoiding the computation of similaritymatrices. Moreover, we apply the rules of voting elections, treating each data point as a voter and casting a votefor the point with the highest density among its KNN. By utilizing the vote counts of each point, we develop thestrategy for classifying noise points and potential cluster centers, allowing the algorithm to identify clusters withuneven density and complex shapes. Additionally, we define the concept of “adhesive points” between two clustersto merge adjacent clusters that have similar densities. This process helps us identify the optimal number of clustersautomatically. Experimental results indicate that our algorithm not only improves the efficiency of clustering butalso increases its accuracy.
文摘流形数据由一些弧线状或环状的类簇组成,其特点是同一类簇的样本间距离差距较大。密度峰值聚类算法不能有效识别流形类簇的类簇中心且分配剩余样本时易引发样本的连续误分配问题。为此,本文提出面向流形数据的共享近邻密度峰值聚类(density peaks clustering based on shared nearest neighbor for manifold datasets,DPC-SNN)算法。提出了一种基于共享近邻的样本相似度定义方式,使得同一流形类簇样本间的相似度尽可能高;基于上述相似度定义局部密度,不忽略距类簇中心较远样本的密度贡献,能更好地区分出流形类簇的类簇中心与其他样本;根据样本的相似度分配剩余样本,避免了样本的连续误分配。DPC-SNN算法与DPC、FKNNDPC、FNDPC、DPCSA及IDPC-FA算法的对比实验结果表明,DPC-SNN算法能够有效发现流形数据的类簇中心并准确完成聚类,对真实以及人脸数据集也有不错的聚类效果。
文摘密度峰值聚类(density peaks clustering,DPC)是一种基于密度的聚类算法,该算法可以直观地确定类簇数量,识别任意形状的类簇,并且自动检测、排除异常点.然而,DPC仍存在些许不足:一方面,DPC算法仅考虑全局分布,在类簇密度差距较大的数据集聚类效果较差;另一方面,DPC中点的分配策略容易导致“多米诺效应”.为此,基于代表点(representative points)与K近邻(K-nearest neighbors,KNN)提出了RKNN-DPC算法.首先,构造了K近邻密度,再引入代表点刻画样本的全局分布,提出了新的局部密度;然后,利用样本的K近邻信息,提出一种加权的K近邻分配策略以缓解“多米诺效应”;最后,在人工数据集和真实数据集上与5种聚类算法进行了对比实验,实验结果表明,所提出的RKNN-DPC可以更准确地识别类簇中心并且获得更好的聚类结果.
文摘The Feixianguan Formation reservoirs in northeastern Sichuan are mainly a suite of carbonate platform deposits.The reservoir types are diverse with high heterogeneity and complex genetic mechanisms.Pores,vugs and fractures of different genetic mechanisms and scales are often developed in association,and it is difficult to classify reservoir types merely based on static data such as outcrop observation,and cores and logging data.In the study,the reservoirs in the Feixianguan Formation are grouped into five types by combining dynamic and static data,that is,karst breccia-residual vuggy type,solution-enhanced vuggy type,fractured-vuggy type,fractured type and matrix type(non-reservoir).Based on conventional logging data,core data and formation microscanner image(FMI)data of the Qilibei block,northeastern Sichuan Basin,the reservoirs are classified in accordance with fracture-vug matching relationship.Based on the principle of cluster analysis,K-Nearest Neighbor(KNN)classification templates are established,and the applicability of the model is verified by using the reservoir data from wells uninvolved in modeling.Following the analysis of the results of reservoir type discrimination and the production of corresponding reservoir intervals,the contributions of various reservoir types to production are evaluated and the reliability of reservoir type classification is verified.The results show that the solution-enhanced vuggy type is of high-quality sweet spot reservoir in the study area with good physical property and high gas production,followed by the fractured-vuggy type,and the fractured and karst breccia-residual vuggy types are the least promising.