Expanding internet-connected services has increased cyberattacks,many of which have grave and disastrous repercussions.An Intrusion Detection System(IDS)plays an essential role in network security since it helps to pr...Expanding internet-connected services has increased cyberattacks,many of which have grave and disastrous repercussions.An Intrusion Detection System(IDS)plays an essential role in network security since it helps to protect the network from vulnerabilities and attacks.Although extensive research was reported in IDS,detecting novel intrusions with optimal features and reducing false alarm rates are still challenging.Therefore,we developed a novel fusion-based feature importance method to reduce the high dimensional feature space,which helps to identify attacks accurately with less false alarm rate.Initially,to improve training data quality,various preprocessing techniques are utilized.The Adaptive Synthetic oversampling technique generates synthetic samples for minority classes.In the proposed fusion-based feature importance,we use different approaches from the filter,wrapper,and embedded methods like mutual information,random forest importance,permutation importance,Shapley Additive exPlanations(SHAP)-based feature importance,and statistical feature importance methods like the difference of mean and median and standard deviation to rank each feature according to its rank.Then by simple plurality voting,the most optimal features are retrieved.Then the optimal features are fed to various models like Extra Tree(ET),Logistic Regression(LR),Support vector Machine(SVM),Decision Tree(DT),and Extreme Gradient Boosting Machine(XGBM).Then the hyperparameters of classification models are tuned with Halving Random Search cross-validation to enhance the performance.The experiments were carried out on the original imbalanced data and balanced data.The outcomes demonstrate that the balanced data scenario knocked out the imbalanced data.Finally,the experimental analysis proved that our proposed fusionbased feature importance performed well with XGBM giving an accuracy of 99.86%,99.68%,and 92.4%,with 9,7 and 8 features by training time of 1.5,4.5 and 5.5 s on Network Security Laboratory-Knowledge Discovery in Databases(NSL-KDD),Canadian Institute for Cybersecurity(CIC-IDS 2017),and UNSW-NB15,datasets respectively.In addition,the suggested technique has been examined and contrasted with the state of art methods on three datasets.展开更多
A fast feature ranking algorithm for classification in the presence of high dimensionahty and small sample size is proposed. The basic idea is that the important features force the data points of the same class to mai...A fast feature ranking algorithm for classification in the presence of high dimensionahty and small sample size is proposed. The basic idea is that the important features force the data points of the same class to maintain their intrinsic neighbor relations, whereas neighboring points of different classes are no longer to stick to one an- other. Applying this assumption, an optimization problem weighting each feature is derived. The algorithm does not involve the dense matrix eigen-decomposition which can be computationally expensive in time. Extensive exper- iments are conducted to validate the significance of selected features using the Yale, Extended YaleB and PIE data- sets. The thorough evaluation shows that, using one-nearest neighbor classifier, the recognition rates using 100-- 500 leading features selected by the algorithm distinctively outperform those with features selected by the baseline feature selection algorithms, while using support vector machine features selected by the algorithm show less prominent improvement. Moreover, the experiments demonstrate that the proposed algorithm is particularly effi- cient for multi-class face recognition problem.展开更多
In this paper, a feature selection method combining the reliefF and SVM-RFE algorithm is proposed. This algorithm integrates the weight vector from the reliefF into SVM-RFE method. In this method, the reliefF filters ...In this paper, a feature selection method combining the reliefF and SVM-RFE algorithm is proposed. This algorithm integrates the weight vector from the reliefF into SVM-RFE method. In this method, the reliefF filters out many noisy features in the first stage. Then the new ranking criterion based on SVM-RFE method is applied to obtain the final feature subset. The SVM classifier is used to evaluate the final image classification accuracy. Experimental results show that our proposed relief- SVM-RFE algorithm can achieve significant improvements for feature selection in image classification.展开更多
The quality of expert ranking directly affects the expert retrieval precision.According to the characteristics of the expert entity,an expert ranking model based on the list with multiple features was proposed.Firstly...The quality of expert ranking directly affects the expert retrieval precision.According to the characteristics of the expert entity,an expert ranking model based on the list with multiple features was proposed.Firstly,multiple features was selected through the analysis of expert pages;secondly,in order to learn parameters through gradient descent and construct expert ranking model,all features were integrated into ListNet ranking model;finally,expert ranking contrast experiment will be performed using the trained model.The experimental results show that the proposed method has a good effect,and the value of NDCG@1 increased14.2%comparing with the pairwise method with expert ranking.展开更多
Data size plays a significant role in the design and the performance of data mining models.A good feature selection algorithm reduces the problems of big data size and noise due to data redundancy.Features selection a...Data size plays a significant role in the design and the performance of data mining models.A good feature selection algorithm reduces the problems of big data size and noise due to data redundancy.Features selection algorithms aim at selecting the best features and eliminating unnecessary ones,which in turn simplifies the structure of the data mining model as well as increases its performance.This paper introduces a robust features selection algorithm,named Features Ranking Voting Algorithm FRV.It merges the benefits of the different features selection algorithms to specify the features ranks in the dataset correctly and robustly;based on the feature ranks and voting algorithm.The FRV comprises of three different proposed techniques to select the minimum best feature set,the forward voting technique to select the best high ranks features,the backward voting technique,which drops the low ranks features(low importance feature),and the third technique merges the outputs from the forward and backward techniques to maximize the robustness of the selected features set.Different data mining models were built using obtained selected features sets from applying the proposed FVR on different datasets;to evaluate the success behavior of the proposed FRV.The high performance of these data mining models reflects the success of the proposed FRV algorithm.The FRV performance is compared with other features selection algorithms.It successes to develop data mining models for the Hungarian CAD dataset with Acc.of 96.8%,and with Acc.of 96%for the Z-Alizadeh Sani CAD dataset compared with 83.94%and 92.56%respectively in[48].展开更多
Accurate performance prediction of Grid workflow activities can help Grid schedulers map activitiesto appropriate Grid sites.This paper describes an approach based on features-ranked RBF neural networkto predict the p...Accurate performance prediction of Grid workflow activities can help Grid schedulers map activitiesto appropriate Grid sites.This paper describes an approach based on features-ranked RBF neural networkto predict the performance of Grid workflow activities.Experimental results for two kinds of real worldGrid workflow activities are presented to show effectiveness of our approach.展开更多
Feature selection is the pretreatment of data mining. Heuristic search algorithms are often used for this subject. Many heuristic search algorithms are based on discernibility matrices, which only consider the differe...Feature selection is the pretreatment of data mining. Heuristic search algorithms are often used for this subject. Many heuristic search algorithms are based on discernibility matrices, which only consider the difference in information system. Because the similar characteristics are not revealed in discernibility matrix, the result may not be the simplest rules. Although differencesimilitude(DS) methods take both of the difference and the similitude into account, the existing search strategy will cause some important features to be ignored. An improved DS based algorithm is proposed to solve this problem in this paper. An attribute rank function, which considers both of the difference and similitude in feature selection, is defined in the improved algorithm. Experiments show that it is an effective algorithm, especially for large-scale databases. The time complexity of the algorithm is O(| C |^2|U |^2).展开更多
考虑潜在高价值旅客特有的数据高度不平衡、旅客特征和价值类别弱相关等问题,提出一种基于三重混合采样和集成学习的潜在高价值旅客发现模型。采用RFM(Recency Frequency Monetary)方法标注旅客类别;使用三重混合采样对不平衡旅客数据...考虑潜在高价值旅客特有的数据高度不平衡、旅客特征和价值类别弱相关等问题,提出一种基于三重混合采样和集成学习的潜在高价值旅客发现模型。采用RFM(Recency Frequency Monetary)方法标注旅客类别;使用三重混合采样对不平衡旅客数据集进行重采样;使用融合特征选择算法遴选旅客特征;使用梯度提升决策树作为分类器,构建旅客价值预测模型,识别潜在高价值旅客。在PNR数据集上的实验结果表明,与基准算法相比,该模型能取得更好的AUC值和F1值,可以较好地识别潜在高价值旅客。展开更多
评审专家遴选是技术评审中的关键环节。鉴于颠覆性技术专家预判平台预判系统对时效性和智能型的要求,专家遴选对预判结果具有决定性影响。通过学术专长匹配和专业遴选来选择符合要求的专家,可以降低成本,提高推荐效率与准确度,完成颠覆...评审专家遴选是技术评审中的关键环节。鉴于颠覆性技术专家预判平台预判系统对时效性和智能型的要求,专家遴选对预判结果具有决定性影响。通过学术专长匹配和专业遴选来选择符合要求的专家,可以降低成本,提高推荐效率与准确度,完成颠覆性技术的预测任务。基于学术网络表示学习的方法既可以避免大量特征工程,又可以方便不同类型的特征进行融合。利用异质网络表示学习方法和标签排序的学术专长画像方法构建专家库,并使用融合专家综合评价指标特征的匹配方法对待预判的颠覆性技术和专家专长进行匹配,为专家遴选提供一份专业背景匹配的候选专家列表。这种方法在Academic Social Network数据集上进行模拟实验。实验结果表明,这种方法能提升项目评审专家学术专长匹配,在加入综合指标特征后,专家的综合指标特征能有效地反馈到实验结果中,从而提高评审系统的时效性和智能性。展开更多
The potential of text analytics is revealed by Machine Learning(ML)and Natural Language Processing(NLP)techniques.In this paper,we propose an NLP framework that is applied to multiple datasets to detect malicious Unif...The potential of text analytics is revealed by Machine Learning(ML)and Natural Language Processing(NLP)techniques.In this paper,we propose an NLP framework that is applied to multiple datasets to detect malicious Uniform Resource Locators(URLs).Three categories of features,both ML and Deep Learning(DL)algorithms and a ranking schema are included in the proposed framework.We apply frequency and prediction-based embeddings,such as hash vectorizer,Term Frequency-Inverse Dense Frequency(TF-IDF)and predictors,word to vector-word2vec(continuous bag of words,skip-gram)from Google,to extract features from text.Further,we apply more state-of-the-art methods to create vectorized features,such as GloVe.Additionally,feature engineering that is specific to URL structure is deployed to detect scams and other threats.For framework assessment,four ranking indicators are weighted:computational time and performance as accuracy,F1 score and type error II.For the computational time,we propose a new metric-Feature Building Time(FBT)as the cutting-edge feature builders(like doc2vec or GloVe)require more time.By applying the proposed assessment step,the skip-gram algorithm of word2vec surpasses other feature builders in performance.Additionally,eXtreme Gradient Boost(XGB)outperforms other classifiers.With this setup,we attain an accuracy of 99.5%and an F1 score of 0.99.展开更多
文摘Expanding internet-connected services has increased cyberattacks,many of which have grave and disastrous repercussions.An Intrusion Detection System(IDS)plays an essential role in network security since it helps to protect the network from vulnerabilities and attacks.Although extensive research was reported in IDS,detecting novel intrusions with optimal features and reducing false alarm rates are still challenging.Therefore,we developed a novel fusion-based feature importance method to reduce the high dimensional feature space,which helps to identify attacks accurately with less false alarm rate.Initially,to improve training data quality,various preprocessing techniques are utilized.The Adaptive Synthetic oversampling technique generates synthetic samples for minority classes.In the proposed fusion-based feature importance,we use different approaches from the filter,wrapper,and embedded methods like mutual information,random forest importance,permutation importance,Shapley Additive exPlanations(SHAP)-based feature importance,and statistical feature importance methods like the difference of mean and median and standard deviation to rank each feature according to its rank.Then by simple plurality voting,the most optimal features are retrieved.Then the optimal features are fed to various models like Extra Tree(ET),Logistic Regression(LR),Support vector Machine(SVM),Decision Tree(DT),and Extreme Gradient Boosting Machine(XGBM).Then the hyperparameters of classification models are tuned with Halving Random Search cross-validation to enhance the performance.The experiments were carried out on the original imbalanced data and balanced data.The outcomes demonstrate that the balanced data scenario knocked out the imbalanced data.Finally,the experimental analysis proved that our proposed fusionbased feature importance performed well with XGBM giving an accuracy of 99.86%,99.68%,and 92.4%,with 9,7 and 8 features by training time of 1.5,4.5 and 5.5 s on Network Security Laboratory-Knowledge Discovery in Databases(NSL-KDD),Canadian Institute for Cybersecurity(CIC-IDS 2017),and UNSW-NB15,datasets respectively.In addition,the suggested technique has been examined and contrasted with the state of art methods on three datasets.
基金Supported by the National Natural Science Foundation of China(71001072)the Natural Science Foundation of Guangdong Province(9451806001002294)
文摘A fast feature ranking algorithm for classification in the presence of high dimensionahty and small sample size is proposed. The basic idea is that the important features force the data points of the same class to maintain their intrinsic neighbor relations, whereas neighboring points of different classes are no longer to stick to one an- other. Applying this assumption, an optimization problem weighting each feature is derived. The algorithm does not involve the dense matrix eigen-decomposition which can be computationally expensive in time. Extensive exper- iments are conducted to validate the significance of selected features using the Yale, Extended YaleB and PIE data- sets. The thorough evaluation shows that, using one-nearest neighbor classifier, the recognition rates using 100-- 500 leading features selected by the algorithm distinctively outperform those with features selected by the baseline feature selection algorithms, while using support vector machine features selected by the algorithm show less prominent improvement. Moreover, the experiments demonstrate that the proposed algorithm is particularly effi- cient for multi-class face recognition problem.
文摘In this paper, a feature selection method combining the reliefF and SVM-RFE algorithm is proposed. This algorithm integrates the weight vector from the reliefF into SVM-RFE method. In this method, the reliefF filters out many noisy features in the first stage. Then the new ranking criterion based on SVM-RFE method is applied to obtain the final feature subset. The SVM classifier is used to evaluate the final image classification accuracy. Experimental results show that our proposed relief- SVM-RFE algorithm can achieve significant improvements for feature selection in image classification.
基金Supported by the National Natural Science Foundation of China(61175068)
文摘The quality of expert ranking directly affects the expert retrieval precision.According to the characteristics of the expert entity,an expert ranking model based on the list with multiple features was proposed.Firstly,multiple features was selected through the analysis of expert pages;secondly,in order to learn parameters through gradient descent and construct expert ranking model,all features were integrated into ListNet ranking model;finally,expert ranking contrast experiment will be performed using the trained model.The experimental results show that the proposed method has a good effect,and the value of NDCG@1 increased14.2%comparing with the pairwise method with expert ranking.
文摘Data size plays a significant role in the design and the performance of data mining models.A good feature selection algorithm reduces the problems of big data size and noise due to data redundancy.Features selection algorithms aim at selecting the best features and eliminating unnecessary ones,which in turn simplifies the structure of the data mining model as well as increases its performance.This paper introduces a robust features selection algorithm,named Features Ranking Voting Algorithm FRV.It merges the benefits of the different features selection algorithms to specify the features ranks in the dataset correctly and robustly;based on the feature ranks and voting algorithm.The FRV comprises of three different proposed techniques to select the minimum best feature set,the forward voting technique to select the best high ranks features,the backward voting technique,which drops the low ranks features(low importance feature),and the third technique merges the outputs from the forward and backward techniques to maximize the robustness of the selected features set.Different data mining models were built using obtained selected features sets from applying the proposed FVR on different datasets;to evaluate the success behavior of the proposed FRV.The high performance of these data mining models reflects the success of the proposed FRV algorithm.The FRV performance is compared with other features selection algorithms.It successes to develop data mining models for the Hungarian CAD dataset with Acc.of 96.8%,and with Acc.of 96%for the Z-Alizadeh Sani CAD dataset compared with 83.94%and 92.56%respectively in[48].
基金Supported by the European Union through the IST-034601 edutain@grid project
文摘Accurate performance prediction of Grid workflow activities can help Grid schedulers map activitiesto appropriate Grid sites.This paper describes an approach based on features-ranked RBF neural networkto predict the performance of Grid workflow activities.Experimental results for two kinds of real worldGrid workflow activities are presented to show effectiveness of our approach.
基金Supported by the National Natural Science Foundation of China (90204008)Chen-Guang Plan of Wuhan City(20055003059-3)
文摘Feature selection is the pretreatment of data mining. Heuristic search algorithms are often used for this subject. Many heuristic search algorithms are based on discernibility matrices, which only consider the difference in information system. Because the similar characteristics are not revealed in discernibility matrix, the result may not be the simplest rules. Although differencesimilitude(DS) methods take both of the difference and the similitude into account, the existing search strategy will cause some important features to be ignored. An improved DS based algorithm is proposed to solve this problem in this paper. An attribute rank function, which considers both of the difference and similitude in feature selection, is defined in the improved algorithm. Experiments show that it is an effective algorithm, especially for large-scale databases. The time complexity of the algorithm is O(| C |^2|U |^2).
文摘考虑潜在高价值旅客特有的数据高度不平衡、旅客特征和价值类别弱相关等问题,提出一种基于三重混合采样和集成学习的潜在高价值旅客发现模型。采用RFM(Recency Frequency Monetary)方法标注旅客类别;使用三重混合采样对不平衡旅客数据集进行重采样;使用融合特征选择算法遴选旅客特征;使用梯度提升决策树作为分类器,构建旅客价值预测模型,识别潜在高价值旅客。在PNR数据集上的实验结果表明,与基准算法相比,该模型能取得更好的AUC值和F1值,可以较好地识别潜在高价值旅客。
文摘评审专家遴选是技术评审中的关键环节。鉴于颠覆性技术专家预判平台预判系统对时效性和智能型的要求,专家遴选对预判结果具有决定性影响。通过学术专长匹配和专业遴选来选择符合要求的专家,可以降低成本,提高推荐效率与准确度,完成颠覆性技术的预测任务。基于学术网络表示学习的方法既可以避免大量特征工程,又可以方便不同类型的特征进行融合。利用异质网络表示学习方法和标签排序的学术专长画像方法构建专家库,并使用融合专家综合评价指标特征的匹配方法对待预判的颠覆性技术和专家专长进行匹配,为专家遴选提供一份专业背景匹配的候选专家列表。这种方法在Academic Social Network数据集上进行模拟实验。实验结果表明,这种方法能提升项目评审专家学术专长匹配,在加入综合指标特征后,专家的综合指标特征能有效地反馈到实验结果中,从而提高评审系统的时效性和智能性。
基金supported by a grant of the Ministry of Research,Innovation and Digitization,CNCS-UEFISCDI,Project Number PN-Ⅲ-P4-PCE-2021-0334,within PNCDI Ⅲ.
文摘The potential of text analytics is revealed by Machine Learning(ML)and Natural Language Processing(NLP)techniques.In this paper,we propose an NLP framework that is applied to multiple datasets to detect malicious Uniform Resource Locators(URLs).Three categories of features,both ML and Deep Learning(DL)algorithms and a ranking schema are included in the proposed framework.We apply frequency and prediction-based embeddings,such as hash vectorizer,Term Frequency-Inverse Dense Frequency(TF-IDF)and predictors,word to vector-word2vec(continuous bag of words,skip-gram)from Google,to extract features from text.Further,we apply more state-of-the-art methods to create vectorized features,such as GloVe.Additionally,feature engineering that is specific to URL structure is deployed to detect scams and other threats.For framework assessment,four ranking indicators are weighted:computational time and performance as accuracy,F1 score and type error II.For the computational time,we propose a new metric-Feature Building Time(FBT)as the cutting-edge feature builders(like doc2vec or GloVe)require more time.By applying the proposed assessment step,the skip-gram algorithm of word2vec surpasses other feature builders in performance.Additionally,eXtreme Gradient Boost(XGB)outperforms other classifiers.With this setup,we attain an accuracy of 99.5%and an F1 score of 0.99.