In recent years,the detection of fake job descriptions has become increasingly necessary because social networking has changed the way people access burgeoning information in the internet age.Identifying fraud in job ...In recent years,the detection of fake job descriptions has become increasingly necessary because social networking has changed the way people access burgeoning information in the internet age.Identifying fraud in job descriptions can help jobseekers to avoid many of the risks of job hunting.However,the problem of detecting fake job descriptions comes up against the problem of class imbalance when the number of genuine jobs exceeds the number of fake jobs.This causes a reduction in the predictability and performance of traditional machine learning models.We therefore present an efficient framework that uses an oversampling technique called FJD-OT(Fake Job Description Detection Using Oversampling Techniques)to improve the predictability of detecting fake job descriptions.In the proposed framework,we apply several techniques including the removal of stop words and the use of a tokenizer to preprocess the text data in the first module.We then use a bag of words in combination with the term frequency-inverse document frequency(TF-IDF)approach to extract the features from the text data to create the feature dataset in the second module.Next,our framework applies k-fold cross-validation,a commonly used technique to test the effectiveness of machine learning models,that splits the experimental dataset[the Employment Scam Aegean(ESA)dataset in our study]into training and test sets for evaluation.The training set is passed through the third module,an oversampling module in which the SVMSMOTE method is used to balance data before training the classifiers in the last module.The experimental results indicate that the proposed approach significantly improves the predictability of fake job description detection on the ESA dataset based on several popular performance metrics.展开更多
Datasets with the imbalanced class distribution are difficult to handle with the standard classification algorithms.In supervised learning,dealing with the problem of class imbalance is still considered to be a challe...Datasets with the imbalanced class distribution are difficult to handle with the standard classification algorithms.In supervised learning,dealing with the problem of class imbalance is still considered to be a challenging research problem.Various machine learning techniques are designed to operate on balanced datasets;therefore,the state of the art,different undersampling,over-sampling and hybrid strategies have been proposed to deal with the problem of imbalanced datasets,but highly skewed datasets still pose the problem of generalization and noise generation during resampling.To overcome these problems,this paper proposes amajority clusteringmodel for classification of imbalanced datasets known as MCBC-SMOTE(Majority Clustering for balanced Classification-SMOTE).The model provides a method to convert the problem of binary classification into a multi-class problem.In the proposed algorithm,the number of clusters for themajority class is calculated using the elbow method and the minority class is over-sampled as an average of clustered majority classes to generate a symmetrical class distribution.The proposed technique is cost-effective,reduces the problem of noise generation and successfully disables the imbalances present in between and within classes.The results of the evaluations on diverse real datasets proved to provide better classification results as compared to state of the art existing methodologies based on several performance metrics.展开更多
Recently,Financial Technology(FinTech)has received more attention among financial sectors and researchers to derive effective solutions for any financial institution or firm.Financial crisis prediction(FCP)is an essen...Recently,Financial Technology(FinTech)has received more attention among financial sectors and researchers to derive effective solutions for any financial institution or firm.Financial crisis prediction(FCP)is an essential topic in business sector that finds it useful to identify the financial condition of a financial institution.At the same time,the development of the internet of things(IoT)has altered the mode of human interaction with the physical world.The IoT can be combined with the FCP model to examine the financial data from the users and perform decision making process.This paper presents a novel multi-objective squirrel search optimization algorithm with stacked autoencoder(MOSSA-SAE)model for FCP in IoT environment.The MOSSA-SAE model encompasses different subprocesses namely preprocessing,class imbalance handling,parameter tuning,and classification.Primarily,the MOSSA-SAE model allows the IoT devices such as smartphones,laptops,etc.,to collect the financial details of the users which are then transmitted to the cloud for further analysis.In addition,SMOTE technique is employed to handle class imbalance problems.The goal of MOSSA in SMOTE is to determine the oversampling rate and area of nearest neighbors of SMOTE.Besides,SAE model is utilized as a classification technique to determine the class label of the financial data.At the same time,the MOSSA is applied to appropriately select the‘weights’and‘bias’values of the SAE.An extensive experimental validation process is performed on the benchmark financial dataset and the results are examined under distinct aspects.The experimental values ensured the superior performance of the MOSSA-SAE model on the applied dataset.展开更多
Achieving higher true positive rate when decreasing false positive rate is always a great challenge to the imbalance learning community.This work combines penalized empirical likelihood method,lower bound algorithm an...Achieving higher true positive rate when decreasing false positive rate is always a great challenge to the imbalance learning community.This work combines penalized empirical likelihood method,lower bound algorithm and Nyströmmethod and applies these techniques along with kernel method to density ratio model.The resulting classifier,density ratio classifier(DRC),is a combination of kernelization,regularization,efficient implementation and threshold moving,all of which are critical to enable DRC to be an effective and powerful method for solving difficult imbalance problems.Compared with other methods,DRC is competitive in that it is widely applicable and it is simple and easy to use without additional imbalance handling skills.In addition,the convergence rate of the estimate of log density ratio is discussed as well.And the results of numerical analysis also show that DRC outperforms other methods in AUC and G-mean score.展开更多
This paper develops sequence-based methods for identifying novel protein-protein interactions (PPIs) by means of support vector machines (SVMs). The authors encode proteins ont only in the gene level but also in t...This paper develops sequence-based methods for identifying novel protein-protein interactions (PPIs) by means of support vector machines (SVMs). The authors encode proteins ont only in the gene level but also in the amino acid level, and design a procedure to select negative training set for dealing with the training dataset imbalance problem, i.e., the number of interacting protein pairs is scarce relative to large scale non-interacting protein pairs. The proposed methods are validated on PPIs data of Plasmodium falciparum and Escherichia coli, and yields the predictive accuracy of 93.8% and 95.3%, respectively. The functional annotation analysis and database search indicate that our novel predictions are worthy of future experimental validation. The new methods will be useful supplementary tools for the future proteomics studies.展开更多
文摘In recent years,the detection of fake job descriptions has become increasingly necessary because social networking has changed the way people access burgeoning information in the internet age.Identifying fraud in job descriptions can help jobseekers to avoid many of the risks of job hunting.However,the problem of detecting fake job descriptions comes up against the problem of class imbalance when the number of genuine jobs exceeds the number of fake jobs.This causes a reduction in the predictability and performance of traditional machine learning models.We therefore present an efficient framework that uses an oversampling technique called FJD-OT(Fake Job Description Detection Using Oversampling Techniques)to improve the predictability of detecting fake job descriptions.In the proposed framework,we apply several techniques including the removal of stop words and the use of a tokenizer to preprocess the text data in the first module.We then use a bag of words in combination with the term frequency-inverse document frequency(TF-IDF)approach to extract the features from the text data to create the feature dataset in the second module.Next,our framework applies k-fold cross-validation,a commonly used technique to test the effectiveness of machine learning models,that splits the experimental dataset[the Employment Scam Aegean(ESA)dataset in our study]into training and test sets for evaluation.The training set is passed through the third module,an oversampling module in which the SVMSMOTE method is used to balance data before training the classifiers in the last module.The experimental results indicate that the proposed approach significantly improves the predictability of fake job description detection on the ESA dataset based on several popular performance metrics.
基金This research was supported by Taif University Researchers Supporting Project number(TURSP-2020/254),Taif University,Taif,Saudi Arabia.
文摘Datasets with the imbalanced class distribution are difficult to handle with the standard classification algorithms.In supervised learning,dealing with the problem of class imbalance is still considered to be a challenging research problem.Various machine learning techniques are designed to operate on balanced datasets;therefore,the state of the art,different undersampling,over-sampling and hybrid strategies have been proposed to deal with the problem of imbalanced datasets,but highly skewed datasets still pose the problem of generalization and noise generation during resampling.To overcome these problems,this paper proposes amajority clusteringmodel for classification of imbalanced datasets known as MCBC-SMOTE(Majority Clustering for balanced Classification-SMOTE).The model provides a method to convert the problem of binary classification into a multi-class problem.In the proposed algorithm,the number of clusters for themajority class is calculated using the elbow method and the minority class is over-sampled as an average of clustered majority classes to generate a symmetrical class distribution.The proposed technique is cost-effective,reduces the problem of noise generation and successfully disables the imbalances present in between and within classes.The results of the evaluations on diverse real datasets proved to provide better classification results as compared to state of the art existing methodologies based on several performance metrics.
文摘Recently,Financial Technology(FinTech)has received more attention among financial sectors and researchers to derive effective solutions for any financial institution or firm.Financial crisis prediction(FCP)is an essential topic in business sector that finds it useful to identify the financial condition of a financial institution.At the same time,the development of the internet of things(IoT)has altered the mode of human interaction with the physical world.The IoT can be combined with the FCP model to examine the financial data from the users and perform decision making process.This paper presents a novel multi-objective squirrel search optimization algorithm with stacked autoencoder(MOSSA-SAE)model for FCP in IoT environment.The MOSSA-SAE model encompasses different subprocesses namely preprocessing,class imbalance handling,parameter tuning,and classification.Primarily,the MOSSA-SAE model allows the IoT devices such as smartphones,laptops,etc.,to collect the financial details of the users which are then transmitted to the cloud for further analysis.In addition,SMOTE technique is employed to handle class imbalance problems.The goal of MOSSA in SMOTE is to determine the oversampling rate and area of nearest neighbors of SMOTE.Besides,SAE model is utilized as a classification technique to determine the class label of the financial data.At the same time,the MOSSA is applied to appropriately select the‘weights’and‘bias’values of the SAE.An extensive experimental validation process is performed on the benchmark financial dataset and the results are examined under distinct aspects.The experimental values ensured the superior performance of the MOSSA-SAE model on the applied dataset.
基金supported by National Natural Science Foundation of China(Grant No.71873128).
文摘Achieving higher true positive rate when decreasing false positive rate is always a great challenge to the imbalance learning community.This work combines penalized empirical likelihood method,lower bound algorithm and Nyströmmethod and applies these techniques along with kernel method to density ratio model.The resulting classifier,density ratio classifier(DRC),is a combination of kernelization,regularization,efficient implementation and threshold moving,all of which are critical to enable DRC to be an effective and powerful method for solving difficult imbalance problems.Compared with other methods,DRC is competitive in that it is widely applicable and it is simple and easy to use without additional imbalance handling skills.In addition,the convergence rate of the estimate of log density ratio is discussed as well.And the results of numerical analysis also show that DRC outperforms other methods in AUC and G-mean score.
基金This research is supported by the Key Project of the National Natural Science Foundation of China under Grant No. 10631070, the National Natural Science Foundation of China under Grant Nos. 10801112, 10971223, 11071252, and the Ph.D Graduate Start Research Foundation of Xinjiang University Funded Project under Grant No. BS080101. Thank Dr. Yong Wang from Institute of Systems Science, Academy of Mathematics and Systems Science for kind discussion and good suggestions.
文摘This paper develops sequence-based methods for identifying novel protein-protein interactions (PPIs) by means of support vector machines (SVMs). The authors encode proteins ont only in the gene level but also in the amino acid level, and design a procedure to select negative training set for dealing with the training dataset imbalance problem, i.e., the number of interacting protein pairs is scarce relative to large scale non-interacting protein pairs. The proposed methods are validated on PPIs data of Plasmodium falciparum and Escherichia coli, and yields the predictive accuracy of 93.8% and 95.3%, respectively. The functional annotation analysis and database search indicate that our novel predictions are worthy of future experimental validation. The new methods will be useful supplementary tools for the future proteomics studies.