Onemust interact with a specific webpage or website in order to use the Internet for communication,teamwork,and other productive activities.However,because phishing websites look benign and not all website visitors ha...Onemust interact with a specific webpage or website in order to use the Internet for communication,teamwork,and other productive activities.However,because phishing websites look benign and not all website visitors have the same knowledge and skills to inspect the trustworthiness of visited websites,they are tricked into disclosing sensitive information and making them vulnerable to malicious software attacks like ransomware.It is impossible to stop attackers fromcreating phishingwebsites,which is one of the core challenges in combating them.However,this threat can be alleviated by detecting a specific website as phishing and alerting online users to take the necessary precautions before handing over sensitive information.In this study,five machine learning(ML)and DL algorithms—cat-boost(CATB),gradient boost(GB),random forest(RF),multilayer perceptron(MLP),and deep neural network(DNN)—were tested with three different reputable datasets and two useful feature selection techniques,to assess the scalability and consistency of each classifier’s performance on varied dataset sizes.The experimental findings reveal that the CATB classifier achieved the best accuracy across all datasets(DS-1,DS-2,and DS-3)with respective values of 97.9%,95.73%,and 98.83%.The GB classifier achieved the second-best accuracy across all datasets(DS-1,DS-2,and DS-3)with respective values of 97.16%,95.18%,and 98.58%.MLP achieved the best computational time across all datasets(DS-1,DS-2,and DS-3)with respective values of 2,7,and 3 seconds despite scoring the lowest accuracy across all datasets.展开更多
During the condition monitoring of a planetary gearbox, features are extracted from raw data for a fault diagnosis.However, different features have different sensitivity for identifying different fault types, and thus...During the condition monitoring of a planetary gearbox, features are extracted from raw data for a fault diagnosis.However, different features have different sensitivity for identifying different fault types, and thus, the selection of a sensitive feature subset from an entire feature set and retaining as much of the class discriminatory information as possible has a directly effect on the accuracy of the classification results. In this paper, an improved hybrid feature selection technique(IHFST) that combines a distance evaluation technique(DET), Pearson’s correlation analysis, and an ad hoc technique is proposed. In IHFST, a temporary feature subset without irrelevant features is first selected according to the distance evaluation criterion of DET, and the Pearson’s correlation analysis and ad hoc technique are then employed to find and remove redundant features in the temporary feature subset, respectively, and hence,a sensitive feature subset without irrelevant or redundant features is selected from the entire feature set. Further, the k-means clustering method is applied to classify the different kinds of health conditions. The effectiveness of the proposed method was validated through several experiments carried out on a planetary gearbox with incipient cracks seeded in the tooth root of the sun gear, planet gear, and ring gear. The results show that the proposed method can successfully distinguish the different health conditions of a planetary gearbox, and achieves a better classification performance than other methods. This study proposes a sensitive feature subset selection method that achieves an obvious improvement in terms of the accuracy of the fault classification.展开更多
The increasing number of security holes in the Internet of Things(IoT)networks creates a question about the reliability of existing network intrusion detection systems.This problem has led to the developing of a resea...The increasing number of security holes in the Internet of Things(IoT)networks creates a question about the reliability of existing network intrusion detection systems.This problem has led to the developing of a research area focused on improving network-based intrusion detection system(NIDS)technologies.According to the analysis of different businesses,most researchers focus on improving the classification results of NIDS datasets by combining machine learning and feature reduction techniques.However,these techniques are not suitable for every type of network.In light of this,whether the optimal algorithm and feature reduction techniques can be generalized across various datasets for IoT networks remains.The paper aims to analyze the methods used in this research and whether they can be generalized to other datasets.Six ML models were used in this study,namely,logistic regression(LR),decision trees(DT),Naive Bayes(NB),random forest(RF),K-nearest neighbors(KNN),and linear SVM.The primary detection algorithms used in this study,Principal Component(PCA)and Gini Impurity-Based Weighted Forest(GIWRF)evaluated against three global ToN-IoT datasets,UNSW-NB15,and Bot-IoT datasets.The optimal number of dimensions for each dataset was not studied by applying the PCA algorithm.It is stated in the paper that the selection of datasets affects the performance of the FE techniques and detection algorithms used.Increasing the efficiency of this research area requires a comprehensive standard feature set that can be used to improve quality over time.展开更多
It is a significant and challenging task to detect the informative features to carry out explainable analysis for high dimensional data,especially for those with very small number of samples.Feature selection especial...It is a significant and challenging task to detect the informative features to carry out explainable analysis for high dimensional data,especially for those with very small number of samples.Feature selection especially the unsupervised ones are the right way to deal with this challenge and realize the task.Therefore,two unsupervised spectral feature selection algorithms are proposed in this paper.They group features using advanced Self-Tuning spectral clustering algorithm based on local standard deviation,so as to detect the global optimal feature clusters as far as possible.Then two feature ranking techniques,including cosine-similarity-based feature ranking and entropy-based feature ranking,are proposed,so that the representative feature of each cluster can be detected to comprise the feature subset on which the explainable classification system will be built.The effectiveness of the proposed algorithms is tested on high dimensional benchmark omics datasets and compared to peer methods,and the statistical test are conducted to determine whether or not the proposed spectral feature selection algorithms are significantly different from those of the peer methods.The extensive experiments demonstrate the proposed unsupervised spectral feature selection algorithms outperform the peer ones in comparison,especially the one based on cosine similarity feature ranking technique.The statistical test results show that the entropy feature ranking based spectral feature selection algorithm performs best.The detected features demonstrate strong discriminative capabilities in downstream classifiers for omics data,such that the AI system built on them would be reliable and explainable.It is especially significant in building transparent and trustworthy medical diagnostic systems from an interpretable AI perspective.展开更多
System analysts often use software fault prediction models to identify fault-prone modules during the design phase of the software development life cycle. The models help predict faulty modules based on the software m...System analysts often use software fault prediction models to identify fault-prone modules during the design phase of the software development life cycle. The models help predict faulty modules based on the software metrics that are input to the models. In this study, we consider 20 types of metrics to develop a model using an extreme learning machine associated with various kernel methods. We evaluate the effectiveness of the mode using a proposed framework based on the cost and efficiency in the testing phases. The evaluation process is carried out by considering case studies for 30 object-oriented software systems. Experimental results demonstrate that the application of a fault prediction model is suitable for projects with the percentage of faulty classes below a certain threshold, which depends on the efficiency of fault identification(low: 47.28%; median: 39.24%; high: 25.72%). We consider nine feature selection techniques to remove the irrelevant metrics and to select the best set of source code metrics for fault prediction.展开更多
文摘Onemust interact with a specific webpage or website in order to use the Internet for communication,teamwork,and other productive activities.However,because phishing websites look benign and not all website visitors have the same knowledge and skills to inspect the trustworthiness of visited websites,they are tricked into disclosing sensitive information and making them vulnerable to malicious software attacks like ransomware.It is impossible to stop attackers fromcreating phishingwebsites,which is one of the core challenges in combating them.However,this threat can be alleviated by detecting a specific website as phishing and alerting online users to take the necessary precautions before handing over sensitive information.In this study,five machine learning(ML)and DL algorithms—cat-boost(CATB),gradient boost(GB),random forest(RF),multilayer perceptron(MLP),and deep neural network(DNN)—were tested with three different reputable datasets and two useful feature selection techniques,to assess the scalability and consistency of each classifier’s performance on varied dataset sizes.The experimental findings reveal that the CATB classifier achieved the best accuracy across all datasets(DS-1,DS-2,and DS-3)with respective values of 97.9%,95.73%,and 98.83%.The GB classifier achieved the second-best accuracy across all datasets(DS-1,DS-2,and DS-3)with respective values of 97.16%,95.18%,and 98.58%.MLP achieved the best computational time across all datasets(DS-1,DS-2,and DS-3)with respective values of 2,7,and 3 seconds despite scoring the lowest accuracy across all datasets.
基金Supported by National Natural Science Foundation of China(Grant No.51475053)
文摘During the condition monitoring of a planetary gearbox, features are extracted from raw data for a fault diagnosis.However, different features have different sensitivity for identifying different fault types, and thus, the selection of a sensitive feature subset from an entire feature set and retaining as much of the class discriminatory information as possible has a directly effect on the accuracy of the classification results. In this paper, an improved hybrid feature selection technique(IHFST) that combines a distance evaluation technique(DET), Pearson’s correlation analysis, and an ad hoc technique is proposed. In IHFST, a temporary feature subset without irrelevant features is first selected according to the distance evaluation criterion of DET, and the Pearson’s correlation analysis and ad hoc technique are then employed to find and remove redundant features in the temporary feature subset, respectively, and hence,a sensitive feature subset without irrelevant or redundant features is selected from the entire feature set. Further, the k-means clustering method is applied to classify the different kinds of health conditions. The effectiveness of the proposed method was validated through several experiments carried out on a planetary gearbox with incipient cracks seeded in the tooth root of the sun gear, planet gear, and ring gear. The results show that the proposed method can successfully distinguish the different health conditions of a planetary gearbox, and achieves a better classification performance than other methods. This study proposes a sensitive feature subset selection method that achieves an obvious improvement in terms of the accuracy of the fault classification.
文摘The increasing number of security holes in the Internet of Things(IoT)networks creates a question about the reliability of existing network intrusion detection systems.This problem has led to the developing of a research area focused on improving network-based intrusion detection system(NIDS)technologies.According to the analysis of different businesses,most researchers focus on improving the classification results of NIDS datasets by combining machine learning and feature reduction techniques.However,these techniques are not suitable for every type of network.In light of this,whether the optimal algorithm and feature reduction techniques can be generalized across various datasets for IoT networks remains.The paper aims to analyze the methods used in this research and whether they can be generalized to other datasets.Six ML models were used in this study,namely,logistic regression(LR),decision trees(DT),Naive Bayes(NB),random forest(RF),K-nearest neighbors(KNN),and linear SVM.The primary detection algorithms used in this study,Principal Component(PCA)and Gini Impurity-Based Weighted Forest(GIWRF)evaluated against three global ToN-IoT datasets,UNSW-NB15,and Bot-IoT datasets.The optimal number of dimensions for each dataset was not studied by applying the PCA algorithm.It is stated in the paper that the selection of datasets affects the performance of the FE techniques and detection algorithms used.Increasing the efficiency of this research area requires a comprehensive standard feature set that can be used to improve quality over time.
基金supported in part by the National Natural Science Foundation of China(Grant Nos.62076159,12031010,61673251,and 61771297)was also supported by the Fundamental Research Funds for the Central Universities(GK202105003)+1 种基金the Natural Science Basic Research Program of Shaanxi Province of China(2022JM334)the Innovation Funds of Graduate Programs at Shaanxi Normal University(2015CXS028 and 2016CSY009).
文摘It is a significant and challenging task to detect the informative features to carry out explainable analysis for high dimensional data,especially for those with very small number of samples.Feature selection especially the unsupervised ones are the right way to deal with this challenge and realize the task.Therefore,two unsupervised spectral feature selection algorithms are proposed in this paper.They group features using advanced Self-Tuning spectral clustering algorithm based on local standard deviation,so as to detect the global optimal feature clusters as far as possible.Then two feature ranking techniques,including cosine-similarity-based feature ranking and entropy-based feature ranking,are proposed,so that the representative feature of each cluster can be detected to comprise the feature subset on which the explainable classification system will be built.The effectiveness of the proposed algorithms is tested on high dimensional benchmark omics datasets and compared to peer methods,and the statistical test are conducted to determine whether or not the proposed spectral feature selection algorithms are significantly different from those of the peer methods.The extensive experiments demonstrate the proposed unsupervised spectral feature selection algorithms outperform the peer ones in comparison,especially the one based on cosine similarity feature ranking technique.The statistical test results show that the entropy feature ranking based spectral feature selection algorithm performs best.The detected features demonstrate strong discriminative capabilities in downstream classifiers for omics data,such that the AI system built on them would be reliable and explainable.It is especially significant in building transparent and trustworthy medical diagnostic systems from an interpretable AI perspective.
基金the FIST project,of DST, government of India for sponsoring the work on web engineering and cloud based computing
文摘System analysts often use software fault prediction models to identify fault-prone modules during the design phase of the software development life cycle. The models help predict faulty modules based on the software metrics that are input to the models. In this study, we consider 20 types of metrics to develop a model using an extreme learning machine associated with various kernel methods. We evaluate the effectiveness of the mode using a proposed framework based on the cost and efficiency in the testing phases. The evaluation process is carried out by considering case studies for 30 object-oriented software systems. Experimental results demonstrate that the application of a fault prediction model is suitable for projects with the percentage of faulty classes below a certain threshold, which depends on the efficiency of fault identification(low: 47.28%; median: 39.24%; high: 25.72%). We consider nine feature selection techniques to remove the irrelevant metrics and to select the best set of source code metrics for fault prediction.