In this paper,we introduce a novel Multi-scale and Auto-tuned Semi-supervised Deep Subspace Clustering(MAS-DSC)algorithm,aimed at addressing the challenges of deep subspace clustering in high-dimensional real-world da...In this paper,we introduce a novel Multi-scale and Auto-tuned Semi-supervised Deep Subspace Clustering(MAS-DSC)algorithm,aimed at addressing the challenges of deep subspace clustering in high-dimensional real-world data,particularly in the field of medical imaging.Traditional deep subspace clustering algorithms,which are mostly unsupervised,are limited in their ability to effectively utilize the inherent prior knowledge in medical images.Our MAS-DSC algorithm incorporates a semi-supervised learning framework that uses a small amount of labeled data to guide the clustering process,thereby enhancing the discriminative power of the feature representations.Additionally,the multi-scale feature extraction mechanism is designed to adapt to the complexity of medical imaging data,resulting in more accurate clustering performance.To address the difficulty of hyperparameter selection in deep subspace clustering,this paper employs a Bayesian optimization algorithm for adaptive tuning of hyperparameters related to subspace clustering,prior knowledge constraints,and model loss weights.Extensive experiments on standard clustering datasets,including ORL,Coil20,and Coil100,validate the effectiveness of the MAS-DSC algorithm.The results show that with its multi-scale network structure and Bayesian hyperparameter optimization,MAS-DSC achieves excellent clustering results on these datasets.Furthermore,tests on a brain tumor dataset demonstrate the robustness of the algorithm and its ability to leverage prior knowledge for efficient feature extraction and enhanced clustering performance within a semi-supervised learning framework.展开更多
As more business transactions and information services have been implemented via communication networks,both personal and organization assets encounter a higher risk of attacks.To safeguard these,a perimeter defence l...As more business transactions and information services have been implemented via communication networks,both personal and organization assets encounter a higher risk of attacks.To safeguard these,a perimeter defence likeNIDS(network-based intrusion detection system)can be effective for known intrusions.There has been a great deal of attention within the joint community of security and data science to improve machine-learning based NIDS such that it becomes more accurate for adversarial attacks,where obfuscation techniques are applied to disguise patterns of intrusive traffics.The current research focuses on non-payload connections at the TCP(transmission control protocol)stack level that is applicable to different network applications.In contrary to the wrapper method introduced with the benchmark dataset,three new filter models are proposed to transform the feature space without knowledge of class labels.These ECT(ensemble clustering based transformation)techniques,i.e.,ECT-Subspace,ECT-Noise and ECT-Combined,are developed using the concept of ensemble clustering and three different ensemble generation strategies,i.e.,random feature subspace,feature noise injection and their combinations.Based on the empirical study with published dataset and four classification algorithms,new models usually outperform that original wrapper and other filter alternatives found in the literature.This is similarly summarized from the first experiment with basic classification of legitimate and direct attacks,and the second that focuses on recognizing obfuscated intrusions.In addition,analysis of algorithmic parameters,i.e.,ensemble size and level of noise,is provided as a guideline for a practical use.展开更多
Clustering is a crucial method for deciphering data structure and producing new information.Due to its significance in revealing fundamental connections between the human brain and events,it is essential to utilize cl...Clustering is a crucial method for deciphering data structure and producing new information.Due to its significance in revealing fundamental connections between the human brain and events,it is essential to utilize clustering for cognitive research.Dealing with noisy data caused by inaccurate synthesis from several sources or misleading data production processes is one of the most intriguing clustering difficulties.Noisy data can lead to incorrect object recognition and inference.This research aims to innovate a novel clustering approach,named Picture-Neutrosophic Trusted Safe Semi-Supervised Fuzzy Clustering(PNTS3FCM),to solve the clustering problem with noisy data using neutral and refusal degrees in the definition of Picture Fuzzy Set(PFS)and Neutrosophic Set(NS).Our contribution is to propose a new optimization model with four essential components:clustering,outlier removal,safe semi-supervised fuzzy clustering and partitioning with labeled and unlabeled data.The effectiveness and flexibility of the proposed technique are estimated and compared with the state-of-art methods,standard Picture fuzzy clustering(FC-PFS)and Confidence-weighted safe semi-supervised clustering(CS3FCM)on benchmark UCI datasets.The experimental results show that our method is better at least 10/15 datasets than the compared methods in terms of clustering quality and computational time.展开更多
Clustering analysis is one of the main concerns in data mining.A common approach to the clustering process is to bring together points that are close to each other and separate points that are away from each other.The...Clustering analysis is one of the main concerns in data mining.A common approach to the clustering process is to bring together points that are close to each other and separate points that are away from each other.Therefore,measuring the distance between sample points is crucial to the effectiveness of clustering.Filtering features by label information and mea-suring the distance between samples by these features is a common supervised learning method to reconstruct distance metric.However,in many application scenarios,it is very expensive to obtain a large number of labeled samples.In this paper,to solve the clustering problem in the few supervised sample and high data dimensionality scenarios,a novel semi-supervised clustering algorithm is proposed by designing an improved prototype network that attempts to reconstruct the distance metric in the sample space with a small amount of pairwise supervised information,such as Must-Link and Cannot-Link,and then cluster the data in the new metric space.The core idea is to make the similar ones closer and the dissimilar ones further away through embedding mapping.Extensive experiments on both real-world and synthetic datasets show the effectiveness of this algorithm.Average clustering metrics on various datasets improved by 8%compared to the comparison algorithm.展开更多
A Machine Learning (ML)-based Intrusion Detection and Prevention System (IDPS)requires a large amount of labeled up-to-date training data to effectively detect intrusions and generalize well to novel attacks.However,t...A Machine Learning (ML)-based Intrusion Detection and Prevention System (IDPS)requires a large amount of labeled up-to-date training data to effectively detect intrusions and generalize well to novel attacks.However,the labeling of data is costly and becomes infeasible when dealing with big data,such as those generated by Intemet of Things applications.To this effect,building an ML model that learns from non-labeled or partially labeled data is of critical importance.This paper proposes a Semi-supervised Mniti-Layered Clustering ((SMLC))model for the detection and prevention of network intrusion.SMLC has the capability to learn from partially labeled data while achieving a detection performance comparable to that of supervised ML-based IDPS.The performance of SMLC is compared with that of a well-known semi-supervised model (tri-training)and of supervised ensemble ML models, namely Random.Forest,Bagging,and AdaboostM1on two benchmark network-intrusion datasets,NSL and Kyoto 2006+.Experimental resnits show that SMLC is superior to tri-training,providing a comparable detection accuracy with 20%less labeled instances of training data.Furthermore,our results demonstrate that our scheme has a detection accuracy comparable to that of the supervised ensemble models.展开更多
Due to the increase in the number of smart meter devices,a power grid generates a large amount of data.Analyzing the data can help in understanding the users’electricity consumption behavior and demands;thus,enabling...Due to the increase in the number of smart meter devices,a power grid generates a large amount of data.Analyzing the data can help in understanding the users’electricity consumption behavior and demands;thus,enabling better service to be provided to them.Performing power load profile clustering is the basis for mining the users’electricity consumption behavior.By examining the complexity,randomness,and uncertainty of the users’electricity consumption behavior,this paper proposes an ensemble clustering method to analyze this behavior.First,principle component analysis(PCA)is used to reduce the dimensions of the data.Subsequently,the single clustering method is used,and the majority is selected for integrated clustering.As a result,the users’electricity consumption behavior is classified into different modes,and their characteristics are analyzed in detail.This paper examines the electricity power data of 19 real users in China for simulation purposes.This manuscript provides a thorough analysis along with suggestions for the users’weekly electricity consumption behavior.The results verify the effectiveness of the proposed method.展开更多
An effective ensemble should consist of a set of networks that are both accurate and diverse. We propose a novel clustering-based selective algorithm for constructing neural network ensemble, where clustering technolo...An effective ensemble should consist of a set of networks that are both accurate and diverse. We propose a novel clustering-based selective algorithm for constructing neural network ensemble, where clustering technology is used to classify trained networks according to similarity and optimally select the most accurate individual network from each cluster to make up the ensemble. Empirical studies on regression of four typical datasets showed that this approach yields significantly smaller en- semble achieving better performance than other traditional ones such as Bagging and Boosting. The bias variance decomposition of the predictive error shows that the success of the proposed approach may lie in its properly tuning the bias/variance trade-off to reduce the prediction error (the sum of bias2 and variance).展开更多
In the wake of global water scarcity, forecasting of water quantity and quality, regionalization of river basins has attracted serious attention of the hydrology researchers. It has become an important area of researc...In the wake of global water scarcity, forecasting of water quantity and quality, regionalization of river basins has attracted serious attention of the hydrology researchers. It has become an important area of research to enhance the quality of prediction of yield in river basins. In this paper, we analyzed the data of Godavari basin, and regionalize it using a cluster ensemble method. Cluster Ensemble methods are commonly used to enhance the quality of clustering by combining multiple clustering schemes to produce a more robust scheme delivering similar homogeneous basins. The goal is to identify, analyse and describe hydrologically similar catchments using cluster analysis. Clustering has been done using RCDA cluster ensemble algorithm, which is based on discriminant analysis. The algorithm takes H base clustering schemes each with K clusters, obtained by any clustering method, as input and constructs discriminant function for each one of them. Subsequently, all the data tuples are predicted using H discriminant functions for cluster membership. Tuples with consistent predictions are assigned to the clusters, while tuples with inconsistent predictions are analyzed further and either assigned to clusters or declared as noise. Clustering results of RCDA algorithm have been compared with Best of k-means and Clue cluster ensemble of R software using traditional clustering quality measures. Further, domain knowledge based comparison has also been performed. All the results are encouraging and indicate better regionalization of the Godavari basin data.展开更多
A novel Support Vector Machine(SVM) ensemble approach using clustering analysis is proposed. Firstly,the positive and negative training examples are clustered through subtractive clus-tering algorithm respectively. Th...A novel Support Vector Machine(SVM) ensemble approach using clustering analysis is proposed. Firstly,the positive and negative training examples are clustered through subtractive clus-tering algorithm respectively. Then some representative examples are chosen from each of them to construct SVM components. At last,the outputs of the individual classifiers are fused through ma-jority voting method to obtain the final decision. Comparisons of performance between the proposed method and other popular ensemble approaches,such as Bagging,Adaboost and k.-fold cross valida-tion,are carried out on synthetic and UCI datasets. The experimental results show that our method has higher classification accuracy since the example distribution information is considered during en-semble through clustering analysis. It further indicates that our method needs a much smaller size of training subsets than Bagging and Adaboost to obtain satisfactory classification accuracy.展开更多
Recommender system is a tool to suggest items to the users from the extensive history of the user’s feedback.Though,it is an emerging research area concerning academics and industries,where it suffers from sparsity,s...Recommender system is a tool to suggest items to the users from the extensive history of the user’s feedback.Though,it is an emerging research area concerning academics and industries,where it suffers from sparsity,scalability,and cold start problems.This paper addresses sparsity,and scalability problems of model-based collaborative recommender system based on ensemble learning approach and enhanced clustering algorithm for movie recommendations.In this paper,an effective movie recommendation system is proposed by Classification and Regression Tree(CART)algorithm,enhanced Balanced Iterative Reducing and Clustering using Hierarchies(BIRCH)algorithm and truncation method.In this research paper,a new hyper parameters tuning is added in BIRCH algorithm to enhance the cluster formation process,where the proposed algorithm is named as enhanced BIRCH.The proposed model yields quality movie recommendation to the new user using Gradient boost classification with broad coverage.In this paper,the proposed model is tested on Movielens dataset,and the performance is evaluated by means of Mean Absolute Error(MAE),precision,recall and f-measure.The experimental results showed the superiority of proposed model in movie recommendation compared to the existing models.The proposed model obtained 0.52 and 0.57 MAE value on Movielens 100k and 1M datasets.Further,the proposed model obtained 0.83 of precision,0.86 of recall and 0.86 of f-measure on Movielens 100k dataset,which are effective compared to the existing models in movie recommendation.展开更多
Target maneuver recognition is a prerequisite for air combat situation awareness,trajectory prediction,threat assessment and maneuver decision.To get rid of the dependence of the current target maneuver recognition me...Target maneuver recognition is a prerequisite for air combat situation awareness,trajectory prediction,threat assessment and maneuver decision.To get rid of the dependence of the current target maneuver recognition method on empirical criteria and sample data,and automatically and adaptively complete the task of extracting the target maneuver pattern,in this paper,an air combat maneuver pattern extraction based on time series segmentation and clustering analysis is proposed by combining autoencoder,G-G clustering algorithm and the selective ensemble clustering analysis algorithm.Firstly,the autoencoder is used to extract key features of maneuvering trajectory to remove the impacts of redundant variables and reduce the data dimension;Then,taking the time information into account,the segmentation of Maneuver characteristic time series is realized with the improved FSTS-AEGG algorithm,and a large number of maneuver primitives are extracted;Finally,the maneuver primitives are grouped into some categories by using the selective ensemble multiple time series clustering algorithm,which can prove that each class represents a maneuver action.The maneuver pattern extraction method is applied to small scale air combat trajectory and can recognize and correctly partition at least 71.3%of maneuver actions,indicating that the method is effective and satisfies the requirements for engineering accuracy.In addition,this method can provide data support for various target maneuvering recognition methods proposed in the literature,greatly reduce the workload and improve the recognition accuracy.展开更多
Multi-label learning deals with objects associated with multiple class labels,and aims to induce a predictive model which can assign a set of relevant class labels for an unseen instance.Since each class might possess...Multi-label learning deals with objects associated with multiple class labels,and aims to induce a predictive model which can assign a set of relevant class labels for an unseen instance.Since each class might possess its own characteristics,the strategy of extracting label-specific features has been widely employed to improve the discrimination process in multi-label learning,where the predictive model is induced based on tailored features specific to each class label instead of the identical instance representations.As a representative approach,LIFT generates label-specific features by conducting clustering analysis.However,its performance may be degraded due to the inherent instability of the single clustering algorithm.To improve this,a novel multi-label learning approach named SENCE(stable label-Specific features gENeration for multi-label learning via mixture-based Clustering Ensemble)is proposed,which stabilizes the generation process of label-specific features via clustering ensemble techniques.Specifically,more stable clustering results are obtained by firstly augmenting the original instance repre-sentation with cluster assignments from base clusters and then fitting a mixture model via the expectation-maximization(EM)algorithm.Extensive experiments on eighteen benchmark data sets show that SENCE performs better than LIFT and other well-established multi-label learning algorithms.展开更多
Semi-supervised clustering improves learning performance as long as it uses a small number of labeled samples to assist un-tagged samples for learning.This paper implements and compares unsupervised and semi-supervise...Semi-supervised clustering improves learning performance as long as it uses a small number of labeled samples to assist un-tagged samples for learning.This paper implements and compares unsupervised and semi-supervised clustering analysis of BOA-Argo ocean text data.Unsupervised K-Means and Affinity Propagation(AP)are two classical clustering algorithms.The Election-AP algorithm is proposed to handle the final cluster number in AP clustering as it has proved to be difficult to control in a suitable range.Semi-supervised samples thermocline data in the BOA-Argo dataset according to the thermocline standard definition,and use this data for semi-supervised cluster analysis.Several semi-supervised clustering algorithms were chosen for comparison of learning performance:Constrained-K-Means,Seeded-K-Means,SAP(Semi-supervised Affinity Propagation),LSAP(Loose Seed AP)and CSAP(Compact Seed AP).In order to adapt the single label,this paper improves the above algorithms to SCKM(improved Constrained-K-Means),SSKM(improved Seeded-K-Means),and SSAP(improved Semi-supervised Affinity Propagationg)to perform semi-supervised clustering analysis on the data.A DSAP(Double Seed AP)semi-supervised clustering algorithm based on compact seeds is proposed as the experimental data shows that DSAP has a better clustering effect.The unsupervised and semi-supervised clustering results are used to analyze the potential patterns of marine data.展开更多
In order to improve performance and robustness of clustering,it is proposed to generate and aggregate a number of primary clusters via clustering ensemble technique.Fuzzy clustering ensemble approaches attempt to impr...In order to improve performance and robustness of clustering,it is proposed to generate and aggregate a number of primary clusters via clustering ensemble technique.Fuzzy clustering ensemble approaches attempt to improve the performance of fuzzy clustering tasks.However,in these approaches,cluster(or clustering)reliability has not paid much attention to.Ignoring cluster(or clustering)reliability makes these approaches weak in dealing with low-quality base clustering methods.In this paper,we have utilized cluster unreliability estimation and local weighting strategy to propose a new fuzzy clustering ensemble method which has introduced Reliability Based weighted co-association matrix Fuzzy C-Means(RBFCM),Reliability Based Graph Partitioning(RBGP)and Reliability Based Hyper Clustering(RBHC)as three new fuzzy clustering consensus functions.Our fuzzy clustering ensemble approach works based on fuzzy cluster unreliability estimation.Cluster unreliability is estimated according to an entropic criterion using the cluster labels in the entire ensemble.To do so,the new metric is dened to estimate the fuzzy cluster unreliability;then,the reliability value of any cluster is determined using a Reliability Driven Cluster Indicator(RDCI).The time complexities of RBHC and RBGP are linearly proportional with thnumber of data objects.Performance and robustness of the proposed method are experimentally evaluated for some benchmark datasets.The experimental results demonstrate efciency and suitability of the proposed method.展开更多
With the rapid development of WLAN( Wireless Local Area Network) technology,an important target of indoor positioning systems is to improve the positioning accuracy while reducing the online computation.In this paper,...With the rapid development of WLAN( Wireless Local Area Network) technology,an important target of indoor positioning systems is to improve the positioning accuracy while reducing the online computation.In this paper,it proposes a novel fingerprint positioning algorithm known as semi-supervised affinity propagation clustering based on distance function constraints. We show that by employing affinity propagation techniques,it is able to use a fractional labeled data to adjust similarity matrix of signal space to cluster reference points with high accuracy. The semi-supervised APC uses a combination of machine learning,clustering analysis and fingerprinting algorithm. By collecting data and testing our algorithm in a realistic indoor WLAN environment,the experimental results indicate that the proposed algorithm can improve positioning accuracy while reduce the online localization computation,as compared with the widely used K nearest neighbor and maximum likelihood estimation algorithms.展开更多
Clustering categorical data, an integral part of data mining,has attracted much attention recently. In this paper, the authors formally define the categorical data clustering problem as an optimization problem from th...Clustering categorical data, an integral part of data mining,has attracted much attention recently. In this paper, the authors formally define the categorical data clustering problem as an optimization problem from the viewpoint of cluster ensemble, and apply cluster ensemble approach for clustering categorical data. Experimental results on real datasets show that better clustering accuracy can be obtained by comparing with existing categorical data clustering algorithms.展开更多
This study presents a model of computer-aided intelligence capable of automatically detecting positive COVID-19 instances for use in regular medical applications.The proposed model is based on an Ensemble boosting Neu...This study presents a model of computer-aided intelligence capable of automatically detecting positive COVID-19 instances for use in regular medical applications.The proposed model is based on an Ensemble boosting Neural Network architecture and can automatically detect discriminatory features on chestX-ray images through Two Step-As clustering algorithm with rich filter families,abstraction and weight-sharing properties.In contrast to the generally used transformational learning approach,the proposed model was trained before and after clustering.The compilation procedure divides the datasets samples and categories into numerous sub-samples and subcategories and then assigns new group labels to each new group,with each subject group displayed as a distinct category.The retrieved characteristics discriminant cases were used to feed the Multiple Neural Network method,which was then utilised to classify the instances.The Two Step-AS clustering method has been modified by pre-aggregating the dataset before applying Multiple Neural Network algorithm to detect COVID-19 cases from chest X-ray findings.Models forMultiple Neural Network and Two Step-As clustering algorithms were optimised by utilising Ensemble Bootstrap Aggregating algorithm to reduce the number of hyper parameters they include.The testswere carried out using theCOVID-19 public radiology database,and a cross-validationmethod ensured accuracy.The proposed classifier with an accuracy of 98.02%percent was found to provide the most efficient outcomes possible.The result is a lowcost,quick and reliable intelligence tool for detecting COVID-19 infection.展开更多
A clustering algorithm for semi-supervised affinity propagation based on layered combination is proposed in this paper in light of existing flaws. To improve accuracy of the algorithm,it introduces the idea of layered...A clustering algorithm for semi-supervised affinity propagation based on layered combination is proposed in this paper in light of existing flaws. To improve accuracy of the algorithm,it introduces the idea of layered combination, divides an affinity propagation clustering( APC) process into several hierarchies evenly,draws samples from data of each hierarchy according to weight,and executes semi-supervised learning through construction of pairwise constraints and use of submanifold label mapping,weighting and combining clustering results of all hierarchies by combined promotion. It is shown by theoretical analysis and experimental result that clustering accuracy and computation complexity of the semi-supervised affinity propagation clustering algorithm based on layered combination( SAP-LC algorithm) have been greatly improved.展开更多
The magnitude and frequency of precipitation is of great significance in the field of hydrologic and hydraulic design and has wide applications in varied areas. However, the availability of precipitation data is limit...The magnitude and frequency of precipitation is of great significance in the field of hydrologic and hydraulic design and has wide applications in varied areas. However, the availability of precipitation data is limited to a few areas, where the rain gauges are successfully and efficiently installed. The magnitude and frequency of precipitation in ungauged sites can be assessed by grouping areas with similar characteristics. The procedure of grouping of areas having similar behaviour is termed as Regionalization. In this paper, RCDA cluster ensemble algorithm is employed to identify the homogeneous regions of rainfall in India. Cluster ensemble methods are commonly used to enhance the quality of clustering by combining multiple clustering schemes to produce a more robust scheme delivering similar homogeneous regions. The goal is to identify, analyse and describe hydrologically similar regions using RCDA cluster ensemble algorithm. RCDA cluster ensemble algorithm, which is based on discriminant analysis. The algorithm takes H base clustering schemes each with K clusters, obtained by any clustering method, as input and constructs discriminant function for each one of them. Subsequently, all the data tuples are predicted using H discriminant functions for cluster membership. Tuples with consistent predictions are assigned to the clusters, while tuples with inconsistent predictions are analyzed further and either assigned to clusters or declared as noise. RCDA algorithm has been compared with Best of K-means and Clue cluster ensemble of R software using traditional clustering quality measures. Further, domain knowledge based comparison has also been performed. All the results are encouraging and indicate better regionalization of the rainfall in different parts of India.展开更多
In the face of a growing number of large-scale data sets, affinity propagation clustering algorithm to calculate the process required to build the similarity matrix, will bring huge storage and computation. Therefore,...In the face of a growing number of large-scale data sets, affinity propagation clustering algorithm to calculate the process required to build the similarity matrix, will bring huge storage and computation. Therefore, this paper proposes an improved affinity propagation clustering algorithm. First, add the subtraction clustering, using the density value of the data points to obtain the point of initial clusters. Then, calculate the similarity distance between the initial cluster points, and reference the idea of semi-supervised clustering, adding pairs restriction information, structure sparse similarity matrix. Finally, the cluster representative points conduct AP clustering until a suitable cluster division.Experimental results show that the algorithm allows the calculation is greatly reduced, the similarity matrix storage capacity is also reduced, and better than the original algorithm on the clustering effect and processing speed.展开更多
基金supported in part by the National Natural Science Foundation of China under Grant 62171203in part by the Jiangsu Province“333 Project”High-Level Talent Cultivation Subsidized Project+2 种基金in part by the SuzhouKey Supporting Subjects for Health Informatics under Grant SZFCXK202147in part by the Changshu Science and Technology Program under Grants CS202015 and CS202246in part by Changshu Key Laboratory of Medical Artificial Intelligence and Big Data under Grants CYZ202301 and CS202314.
文摘In this paper,we introduce a novel Multi-scale and Auto-tuned Semi-supervised Deep Subspace Clustering(MAS-DSC)algorithm,aimed at addressing the challenges of deep subspace clustering in high-dimensional real-world data,particularly in the field of medical imaging.Traditional deep subspace clustering algorithms,which are mostly unsupervised,are limited in their ability to effectively utilize the inherent prior knowledge in medical images.Our MAS-DSC algorithm incorporates a semi-supervised learning framework that uses a small amount of labeled data to guide the clustering process,thereby enhancing the discriminative power of the feature representations.Additionally,the multi-scale feature extraction mechanism is designed to adapt to the complexity of medical imaging data,resulting in more accurate clustering performance.To address the difficulty of hyperparameter selection in deep subspace clustering,this paper employs a Bayesian optimization algorithm for adaptive tuning of hyperparameters related to subspace clustering,prior knowledge constraints,and model loss weights.Extensive experiments on standard clustering datasets,including ORL,Coil20,and Coil100,validate the effectiveness of the MAS-DSC algorithm.The results show that with its multi-scale network structure and Bayesian hyperparameter optimization,MAS-DSC achieves excellent clustering results on these datasets.Furthermore,tests on a brain tumor dataset demonstrate the robustness of the algorithm and its ability to leverage prior knowledge for efficient feature extraction and enhanced clustering performance within a semi-supervised learning framework.
文摘As more business transactions and information services have been implemented via communication networks,both personal and organization assets encounter a higher risk of attacks.To safeguard these,a perimeter defence likeNIDS(network-based intrusion detection system)can be effective for known intrusions.There has been a great deal of attention within the joint community of security and data science to improve machine-learning based NIDS such that it becomes more accurate for adversarial attacks,where obfuscation techniques are applied to disguise patterns of intrusive traffics.The current research focuses on non-payload connections at the TCP(transmission control protocol)stack level that is applicable to different network applications.In contrary to the wrapper method introduced with the benchmark dataset,three new filter models are proposed to transform the feature space without knowledge of class labels.These ECT(ensemble clustering based transformation)techniques,i.e.,ECT-Subspace,ECT-Noise and ECT-Combined,are developed using the concept of ensemble clustering and three different ensemble generation strategies,i.e.,random feature subspace,feature noise injection and their combinations.Based on the empirical study with published dataset and four classification algorithms,new models usually outperform that original wrapper and other filter alternatives found in the literature.This is similarly summarized from the first experiment with basic classification of legitimate and direct attacks,and the second that focuses on recognizing obfuscated intrusions.In addition,analysis of algorithmic parameters,i.e.,ensemble size and level of noise,is provided as a guideline for a practical use.
基金This research is funded by Graduate University of Science and Technology under grant number GUST.STS.DT2020-TT01。
文摘Clustering is a crucial method for deciphering data structure and producing new information.Due to its significance in revealing fundamental connections between the human brain and events,it is essential to utilize clustering for cognitive research.Dealing with noisy data caused by inaccurate synthesis from several sources or misleading data production processes is one of the most intriguing clustering difficulties.Noisy data can lead to incorrect object recognition and inference.This research aims to innovate a novel clustering approach,named Picture-Neutrosophic Trusted Safe Semi-Supervised Fuzzy Clustering(PNTS3FCM),to solve the clustering problem with noisy data using neutral and refusal degrees in the definition of Picture Fuzzy Set(PFS)and Neutrosophic Set(NS).Our contribution is to propose a new optimization model with four essential components:clustering,outlier removal,safe semi-supervised fuzzy clustering and partitioning with labeled and unlabeled data.The effectiveness and flexibility of the proposed technique are estimated and compared with the state-of-art methods,standard Picture fuzzy clustering(FC-PFS)and Confidence-weighted safe semi-supervised clustering(CS3FCM)on benchmark UCI datasets.The experimental results show that our method is better at least 10/15 datasets than the compared methods in terms of clustering quality and computational time.
文摘Clustering analysis is one of the main concerns in data mining.A common approach to the clustering process is to bring together points that are close to each other and separate points that are away from each other.Therefore,measuring the distance between sample points is crucial to the effectiveness of clustering.Filtering features by label information and mea-suring the distance between samples by these features is a common supervised learning method to reconstruct distance metric.However,in many application scenarios,it is very expensive to obtain a large number of labeled samples.In this paper,to solve the clustering problem in the few supervised sample and high data dimensionality scenarios,a novel semi-supervised clustering algorithm is proposed by designing an improved prototype network that attempts to reconstruct the distance metric in the sample space with a small amount of pairwise supervised information,such as Must-Link and Cannot-Link,and then cluster the data in the new metric space.The core idea is to make the similar ones closer and the dissimilar ones further away through embedding mapping.Extensive experiments on both real-world and synthetic datasets show the effectiveness of this algorithm.Average clustering metrics on various datasets improved by 8%compared to the comparison algorithm.
文摘A Machine Learning (ML)-based Intrusion Detection and Prevention System (IDPS)requires a large amount of labeled up-to-date training data to effectively detect intrusions and generalize well to novel attacks.However,the labeling of data is costly and becomes infeasible when dealing with big data,such as those generated by Intemet of Things applications.To this effect,building an ML model that learns from non-labeled or partially labeled data is of critical importance.This paper proposes a Semi-supervised Mniti-Layered Clustering ((SMLC))model for the detection and prevention of network intrusion.SMLC has the capability to learn from partially labeled data while achieving a detection performance comparable to that of supervised ML-based IDPS.The performance of SMLC is compared with that of a well-known semi-supervised model (tri-training)and of supervised ensemble ML models, namely Random.Forest,Bagging,and AdaboostM1on two benchmark network-intrusion datasets,NSL and Kyoto 2006+.Experimental resnits show that SMLC is superior to tri-training,providing a comparable detection accuracy with 20%less labeled instances of training data.Furthermore,our results demonstrate that our scheme has a detection accuracy comparable to that of the supervised ensemble models.
基金supported by the State Grid Science and Technology Project (No.5442AI90009)Natural Science Foundation of China (No. 6170337)
文摘Due to the increase in the number of smart meter devices,a power grid generates a large amount of data.Analyzing the data can help in understanding the users’electricity consumption behavior and demands;thus,enabling better service to be provided to them.Performing power load profile clustering is the basis for mining the users’electricity consumption behavior.By examining the complexity,randomness,and uncertainty of the users’electricity consumption behavior,this paper proposes an ensemble clustering method to analyze this behavior.First,principle component analysis(PCA)is used to reduce the dimensions of the data.Subsequently,the single clustering method is used,and the majority is selected for integrated clustering.As a result,the users’electricity consumption behavior is classified into different modes,and their characteristics are analyzed in detail.This paper examines the electricity power data of 19 real users in China for simulation purposes.This manuscript provides a thorough analysis along with suggestions for the users’weekly electricity consumption behavior.The results verify the effectiveness of the proposed method.
文摘An effective ensemble should consist of a set of networks that are both accurate and diverse. We propose a novel clustering-based selective algorithm for constructing neural network ensemble, where clustering technology is used to classify trained networks according to similarity and optimally select the most accurate individual network from each cluster to make up the ensemble. Empirical studies on regression of four typical datasets showed that this approach yields significantly smaller en- semble achieving better performance than other traditional ones such as Bagging and Boosting. The bias variance decomposition of the predictive error shows that the success of the proposed approach may lie in its properly tuning the bias/variance trade-off to reduce the prediction error (the sum of bias2 and variance).
文摘In the wake of global water scarcity, forecasting of water quantity and quality, regionalization of river basins has attracted serious attention of the hydrology researchers. It has become an important area of research to enhance the quality of prediction of yield in river basins. In this paper, we analyzed the data of Godavari basin, and regionalize it using a cluster ensemble method. Cluster Ensemble methods are commonly used to enhance the quality of clustering by combining multiple clustering schemes to produce a more robust scheme delivering similar homogeneous basins. The goal is to identify, analyse and describe hydrologically similar catchments using cluster analysis. Clustering has been done using RCDA cluster ensemble algorithm, which is based on discriminant analysis. The algorithm takes H base clustering schemes each with K clusters, obtained by any clustering method, as input and constructs discriminant function for each one of them. Subsequently, all the data tuples are predicted using H discriminant functions for cluster membership. Tuples with consistent predictions are assigned to the clusters, while tuples with inconsistent predictions are analyzed further and either assigned to clusters or declared as noise. Clustering results of RCDA algorithm have been compared with Best of k-means and Clue cluster ensemble of R software using traditional clustering quality measures. Further, domain knowledge based comparison has also been performed. All the results are encouraging and indicate better regionalization of the Godavari basin data.
基金the National Natural Science Foundation of China (No.60472072)the Specialized Research Foundation for the Doctoral Program of Higher Educa-tion of China (No.20040699034).
文摘A novel Support Vector Machine(SVM) ensemble approach using clustering analysis is proposed. Firstly,the positive and negative training examples are clustered through subtractive clus-tering algorithm respectively. Then some representative examples are chosen from each of them to construct SVM components. At last,the outputs of the individual classifiers are fused through ma-jority voting method to obtain the final decision. Comparisons of performance between the proposed method and other popular ensemble approaches,such as Bagging,Adaboost and k.-fold cross valida-tion,are carried out on synthetic and UCI datasets. The experimental results show that our method has higher classification accuracy since the example distribution information is considered during en-semble through clustering analysis. It further indicates that our method needs a much smaller size of training subsets than Bagging and Adaboost to obtain satisfactory classification accuracy.
文摘Recommender system is a tool to suggest items to the users from the extensive history of the user’s feedback.Though,it is an emerging research area concerning academics and industries,where it suffers from sparsity,scalability,and cold start problems.This paper addresses sparsity,and scalability problems of model-based collaborative recommender system based on ensemble learning approach and enhanced clustering algorithm for movie recommendations.In this paper,an effective movie recommendation system is proposed by Classification and Regression Tree(CART)algorithm,enhanced Balanced Iterative Reducing and Clustering using Hierarchies(BIRCH)algorithm and truncation method.In this research paper,a new hyper parameters tuning is added in BIRCH algorithm to enhance the cluster formation process,where the proposed algorithm is named as enhanced BIRCH.The proposed model yields quality movie recommendation to the new user using Gradient boost classification with broad coverage.In this paper,the proposed model is tested on Movielens dataset,and the performance is evaluated by means of Mean Absolute Error(MAE),precision,recall and f-measure.The experimental results showed the superiority of proposed model in movie recommendation compared to the existing models.The proposed model obtained 0.52 and 0.57 MAE value on Movielens 100k and 1M datasets.Further,the proposed model obtained 0.83 of precision,0.86 of recall and 0.86 of f-measure on Movielens 100k dataset,which are effective compared to the existing models in movie recommendation.
基金supported by the National Natural Science Foundation of China (Project No.72301293)。
文摘Target maneuver recognition is a prerequisite for air combat situation awareness,trajectory prediction,threat assessment and maneuver decision.To get rid of the dependence of the current target maneuver recognition method on empirical criteria and sample data,and automatically and adaptively complete the task of extracting the target maneuver pattern,in this paper,an air combat maneuver pattern extraction based on time series segmentation and clustering analysis is proposed by combining autoencoder,G-G clustering algorithm and the selective ensemble clustering analysis algorithm.Firstly,the autoencoder is used to extract key features of maneuvering trajectory to remove the impacts of redundant variables and reduce the data dimension;Then,taking the time information into account,the segmentation of Maneuver characteristic time series is realized with the improved FSTS-AEGG algorithm,and a large number of maneuver primitives are extracted;Finally,the maneuver primitives are grouped into some categories by using the selective ensemble multiple time series clustering algorithm,which can prove that each class represents a maneuver action.The maneuver pattern extraction method is applied to small scale air combat trajectory and can recognize and correctly partition at least 71.3%of maneuver actions,indicating that the method is effective and satisfies the requirements for engineering accuracy.In addition,this method can provide data support for various target maneuvering recognition methods proposed in the literature,greatly reduce the workload and improve the recognition accuracy.
基金This work was supported by the National Science Foundation of China(62176055)the China University S&T Innovation Plan Guided by the Ministry of Education.
文摘Multi-label learning deals with objects associated with multiple class labels,and aims to induce a predictive model which can assign a set of relevant class labels for an unseen instance.Since each class might possess its own characteristics,the strategy of extracting label-specific features has been widely employed to improve the discrimination process in multi-label learning,where the predictive model is induced based on tailored features specific to each class label instead of the identical instance representations.As a representative approach,LIFT generates label-specific features by conducting clustering analysis.However,its performance may be degraded due to the inherent instability of the single clustering algorithm.To improve this,a novel multi-label learning approach named SENCE(stable label-Specific features gENeration for multi-label learning via mixture-based Clustering Ensemble)is proposed,which stabilizes the generation process of label-specific features via clustering ensemble techniques.Specifically,more stable clustering results are obtained by firstly augmenting the original instance repre-sentation with cluster assignments from base clusters and then fitting a mixture model via the expectation-maximization(EM)algorithm.Extensive experiments on eighteen benchmark data sets show that SENCE performs better than LIFT and other well-established multi-label learning algorithms.
基金This work was supported in part by the National Natural Science Foundation of China(51679105,61872160,51809112)“Thirteenth Five Plan”Science and Technology Project of Education Department,Jilin Province(JJKH20200990KJ).
文摘Semi-supervised clustering improves learning performance as long as it uses a small number of labeled samples to assist un-tagged samples for learning.This paper implements and compares unsupervised and semi-supervised clustering analysis of BOA-Argo ocean text data.Unsupervised K-Means and Affinity Propagation(AP)are two classical clustering algorithms.The Election-AP algorithm is proposed to handle the final cluster number in AP clustering as it has proved to be difficult to control in a suitable range.Semi-supervised samples thermocline data in the BOA-Argo dataset according to the thermocline standard definition,and use this data for semi-supervised cluster analysis.Several semi-supervised clustering algorithms were chosen for comparison of learning performance:Constrained-K-Means,Seeded-K-Means,SAP(Semi-supervised Affinity Propagation),LSAP(Loose Seed AP)and CSAP(Compact Seed AP).In order to adapt the single label,this paper improves the above algorithms to SCKM(improved Constrained-K-Means),SSKM(improved Seeded-K-Means),and SSAP(improved Semi-supervised Affinity Propagationg)to perform semi-supervised clustering analysis on the data.A DSAP(Double Seed AP)semi-supervised clustering algorithm based on compact seeds is proposed as the experimental data shows that DSAP has a better clustering effect.The unsupervised and semi-supervised clustering results are used to analyze the potential patterns of marine data.
文摘In order to improve performance and robustness of clustering,it is proposed to generate and aggregate a number of primary clusters via clustering ensemble technique.Fuzzy clustering ensemble approaches attempt to improve the performance of fuzzy clustering tasks.However,in these approaches,cluster(or clustering)reliability has not paid much attention to.Ignoring cluster(or clustering)reliability makes these approaches weak in dealing with low-quality base clustering methods.In this paper,we have utilized cluster unreliability estimation and local weighting strategy to propose a new fuzzy clustering ensemble method which has introduced Reliability Based weighted co-association matrix Fuzzy C-Means(RBFCM),Reliability Based Graph Partitioning(RBGP)and Reliability Based Hyper Clustering(RBHC)as three new fuzzy clustering consensus functions.Our fuzzy clustering ensemble approach works based on fuzzy cluster unreliability estimation.Cluster unreliability is estimated according to an entropic criterion using the cluster labels in the entire ensemble.To do so,the new metric is dened to estimate the fuzzy cluster unreliability;then,the reliability value of any cluster is determined using a Reliability Driven Cluster Indicator(RDCI).The time complexities of RBHC and RBGP are linearly proportional with thnumber of data objects.Performance and robustness of the proposed method are experimentally evaluated for some benchmark datasets.The experimental results demonstrate efciency and suitability of the proposed method.
基金Sponsored by the National Natural Science Foundation of China(Grant No.61101122 and 61071105)
文摘With the rapid development of WLAN( Wireless Local Area Network) technology,an important target of indoor positioning systems is to improve the positioning accuracy while reducing the online computation.In this paper,it proposes a novel fingerprint positioning algorithm known as semi-supervised affinity propagation clustering based on distance function constraints. We show that by employing affinity propagation techniques,it is able to use a fractional labeled data to adjust similarity matrix of signal space to cluster reference points with high accuracy. The semi-supervised APC uses a combination of machine learning,clustering analysis and fingerprinting algorithm. By collecting data and testing our algorithm in a realistic indoor WLAN environment,the experimental results indicate that the proposed algorithm can improve positioning accuracy while reduce the online localization computation,as compared with the widely used K nearest neighbor and maximum likelihood estimation algorithms.
文摘Clustering categorical data, an integral part of data mining,has attracted much attention recently. In this paper, the authors formally define the categorical data clustering problem as an optimization problem from the viewpoint of cluster ensemble, and apply cluster ensemble approach for clustering categorical data. Experimental results on real datasets show that better clustering accuracy can be obtained by comparing with existing categorical data clustering algorithms.
基金This work was funded by the Deanship of Scientific Research(DSR)at King Abdulaziz University,Jeddah,Saudi Arabia,under Grant No.(DF-770830-1441)The author,there-fore,gratefully acknowledge the technical and financial support from the DSR.
文摘This study presents a model of computer-aided intelligence capable of automatically detecting positive COVID-19 instances for use in regular medical applications.The proposed model is based on an Ensemble boosting Neural Network architecture and can automatically detect discriminatory features on chestX-ray images through Two Step-As clustering algorithm with rich filter families,abstraction and weight-sharing properties.In contrast to the generally used transformational learning approach,the proposed model was trained before and after clustering.The compilation procedure divides the datasets samples and categories into numerous sub-samples and subcategories and then assigns new group labels to each new group,with each subject group displayed as a distinct category.The retrieved characteristics discriminant cases were used to feed the Multiple Neural Network method,which was then utilised to classify the instances.The Two Step-AS clustering method has been modified by pre-aggregating the dataset before applying Multiple Neural Network algorithm to detect COVID-19 cases from chest X-ray findings.Models forMultiple Neural Network and Two Step-As clustering algorithms were optimised by utilising Ensemble Bootstrap Aggregating algorithm to reduce the number of hyper parameters they include.The testswere carried out using theCOVID-19 public radiology database,and a cross-validationmethod ensured accuracy.The proposed classifier with an accuracy of 98.02%percent was found to provide the most efficient outcomes possible.The result is a lowcost,quick and reliable intelligence tool for detecting COVID-19 infection.
基金the Science and Technology Research Program of Zhejiang Province,China(No.2011C21036)Projects in Science and Technology of Ningbo Municipal,China(No.2012B82003)+1 种基金Shanghai Natural Science Foundation,China(No.10ZR1400100)the National Undergraduate Training Programs for Innovation and Entrepreneurship,China(No.201410876011)
文摘A clustering algorithm for semi-supervised affinity propagation based on layered combination is proposed in this paper in light of existing flaws. To improve accuracy of the algorithm,it introduces the idea of layered combination, divides an affinity propagation clustering( APC) process into several hierarchies evenly,draws samples from data of each hierarchy according to weight,and executes semi-supervised learning through construction of pairwise constraints and use of submanifold label mapping,weighting and combining clustering results of all hierarchies by combined promotion. It is shown by theoretical analysis and experimental result that clustering accuracy and computation complexity of the semi-supervised affinity propagation clustering algorithm based on layered combination( SAP-LC algorithm) have been greatly improved.
文摘The magnitude and frequency of precipitation is of great significance in the field of hydrologic and hydraulic design and has wide applications in varied areas. However, the availability of precipitation data is limited to a few areas, where the rain gauges are successfully and efficiently installed. The magnitude and frequency of precipitation in ungauged sites can be assessed by grouping areas with similar characteristics. The procedure of grouping of areas having similar behaviour is termed as Regionalization. In this paper, RCDA cluster ensemble algorithm is employed to identify the homogeneous regions of rainfall in India. Cluster ensemble methods are commonly used to enhance the quality of clustering by combining multiple clustering schemes to produce a more robust scheme delivering similar homogeneous regions. The goal is to identify, analyse and describe hydrologically similar regions using RCDA cluster ensemble algorithm. RCDA cluster ensemble algorithm, which is based on discriminant analysis. The algorithm takes H base clustering schemes each with K clusters, obtained by any clustering method, as input and constructs discriminant function for each one of them. Subsequently, all the data tuples are predicted using H discriminant functions for cluster membership. Tuples with consistent predictions are assigned to the clusters, while tuples with inconsistent predictions are analyzed further and either assigned to clusters or declared as noise. RCDA algorithm has been compared with Best of K-means and Clue cluster ensemble of R software using traditional clustering quality measures. Further, domain knowledge based comparison has also been performed. All the results are encouraging and indicate better regionalization of the rainfall in different parts of India.
基金This research has been partially supported by the national natural science foundation of China (51175169) and the national science and technology support program (2012BAF02B01).
文摘In the face of a growing number of large-scale data sets, affinity propagation clustering algorithm to calculate the process required to build the similarity matrix, will bring huge storage and computation. Therefore, this paper proposes an improved affinity propagation clustering algorithm. First, add the subtraction clustering, using the density value of the data points to obtain the point of initial clusters. Then, calculate the similarity distance between the initial cluster points, and reference the idea of semi-supervised clustering, adding pairs restriction information, structure sparse similarity matrix. Finally, the cluster representative points conduct AP clustering until a suitable cluster division.Experimental results show that the algorithm allows the calculation is greatly reduced, the similarity matrix storage capacity is also reduced, and better than the original algorithm on the clustering effect and processing speed.