Purpose:The purpose of this study is to serve as a comprehensive review of the existing annotated corpora.This review study aims to provide information on the existing annotated corpora for event extraction,which are ...Purpose:The purpose of this study is to serve as a comprehensive review of the existing annotated corpora.This review study aims to provide information on the existing annotated corpora for event extraction,which are limited but essential for training and improving the existing event extraction algorithms.In addition to the primary goal of this study,it provides guidelines for preparing an annotated corpus and suggests suitable tools for the annotation task.Design/methodology/approach:This study employs an analytical approach to examine available corpus that is suitable for event extraction tasks.It offers an in-depth analysis of existing event extraction corpora and provides systematic guidelines for researchers to develop accurate,high-quality corpora.This ensures the reliability of the created corpus and its suitability for training machine learning algorithms.Findings:Our exploration reveals a scarcity of annotated corpora for event extraction tasks.In particular,the English corpora are mainly focused on the biomedical and general domains.Despite the issue of annotated corpora scarcity,there are several high-quality corpora available and widely used as benchmark datasets.However,access to some of these corpora might be limited owing to closed-access policies or discontinued maintenance after being initially released,rendering them inaccessible owing to broken links.Therefore,this study documents the available corpora for event extraction tasks.Research limitations:Our study focuses only on well-known corpora available in English and Chinese.Nevertheless,this study places a strong emphasis on the English corpora due to its status as a global lingua franca,making it widely understood compared to other languages.Practical implications:We genuinely believe that this study provides valuable knowledge that can serve as a guiding framework for preparing and accurately annotating events from text corpora.It provides comprehensive guidelines for researchers to improve the quality of corpus annotations,especially for event extraction tasks across various domains.Originality/value:This study comprehensively compiled information on the existing annotated corpora for event extraction tasks and provided preparation guidelines.展开更多
Machine learning(ML)practices such as classification have played a very important role in classifying diseases in medical science.Since medical science is a sensitive field,the pre-processing of medical data requires ...Machine learning(ML)practices such as classification have played a very important role in classifying diseases in medical science.Since medical science is a sensitive field,the pre-processing of medical data requires careful handling to make quality clinical decisions.Generally,medical data is considered high-dimensional and complex data that contains many irrelevant and redundant features.These factors indirectly upset the disease prediction and classification accuracy of any ML model.To address this issue,various data pre-processing methods called Feature Selection(FS)techniques have been presented in the literature.However,the majority of such techniques frequently suffer from local minima issues due to large solution space.Thus,this study has proposed a novel wrapper-based Sand Cat SwarmOptimization(SCSO)technique as an FS approach to find optimum features from ten benchmark medical datasets.The SCSO algorithm replicates the hunting and searching strategies of the sand cat while having the advantage of avoiding local optima and finding the ideal solution with minimal control variables.Moreover,K-Nearest Neighbor(KNN)classifier was used to evaluate the effectiveness of the features identified by the proposed SCSO algorithm.The performance of the proposed SCSO algorithm was compared with six state-of-the-art and recent wrapper-based optimization algorithms using the validation metrics of classification accuracy,optimum feature size,and computational cost in seconds.The simulation results on the benchmark medical datasets revealed that the proposed SCSO-KNN approach has outperformed comparative algorithms with an average classification accuracy of 93.96%by selecting 14.2 features within 1.91 s.Additionally,the Wilcoxon rank test was used to perform the significance analysis between the proposed SCSOKNN method and six other algorithms for a p-value less than 5.00E-02.The findings revealed that the proposed algorithm produces better outcomes with an average p-value of 1.82E-02.Moreover,potential future directions are also suggested as a result of the study’s promising findings.展开更多
Clustering high dimensional data is challenging as data dimensionality increases the distance between data points,resulting in sparse regions that degrade clustering performance.Subspace clustering is a common approac...Clustering high dimensional data is challenging as data dimensionality increases the distance between data points,resulting in sparse regions that degrade clustering performance.Subspace clustering is a common approach for processing high-dimensional data by finding relevant features for each cluster in the data space.Subspace clustering methods extend traditional clustering to account for the constraints imposed by data streams.Data streams are not only high-dimensional,but also unbounded and evolving.This necessitates the development of subspace clustering algorithms that can handle high dimensionality and adapt to the unique characteristics of data streams.Although many articles have contributed to the literature review on data stream clustering,there is currently no specific review on subspace clustering algorithms in high-dimensional data streams.Therefore,this article aims to systematically review the existing literature on subspace clustering of data streams in high-dimensional streaming environments.The review follows a systematic methodological approach and includes 18 articles for the final analysis.The analysis focused on two research questions related to the general clustering process and dealing with the unbounded and evolving characteristics of data streams.The main findings relate to six elements:clustering process,cluster search,subspace search,synopsis structure,cluster maintenance,and evaluation measures.Most algorithms use a two-phase clustering approach consisting of an initialization stage,a refinement stage,a cluster maintenance stage,and a final clustering stage.The density-based top-down subspace clustering approach is more widely used than the others because it is able to distinguish true clusters and outliers using projected microclusters.Most algorithms implicitly adapt to the evolving nature of the data stream by using a time fading function that is sensitive to outliers.Future work can focus on the clustering framework,parameter optimization,subspace search techniques,memory-efficient synopsis structures,explicit cluster change detection,and intrinsic performance metrics.This article can serve as a guide for researchers interested in high-dimensional subspace clustering methods for data streams.展开更多
Generative Adversarial Networks(GANs)are neural networks that allow models to learn deep representations without requiring a large amount of training data.Semi-Supervised GAN Classifiers are a recent innovation in GAN...Generative Adversarial Networks(GANs)are neural networks that allow models to learn deep representations without requiring a large amount of training data.Semi-Supervised GAN Classifiers are a recent innovation in GANs,where GANs are used to classify generated images into real and fake and multiple classes,similar to a general multi-class classifier.However,GANs have a sophisticated design that can be challenging to train.This is because obtaining the proper set of parameters for all models-generator,discriminator,and classifier is complex.As a result,training a single GAN model for different datasets may not produce satisfactory results.Therefore,this study proposes an SGAN model(Semi-Supervised GAN Classifier).First,a baseline model was constructed.The model was then enhanced by leveraging the Sine-Cosine Algorithm and Synthetic Minority Oversampling Technique(SMOTE).SMOTE was used to address class imbalances in the dataset,while Sine Cosine Algorithm(SCA)was used to optimize the weights of the classifier models.The optimal set of hyperparameters(learning rate and batch size)were obtained using grid manual search.Four well-known benchmark datasets and a set of evaluation measures were used to validate the proposed model.The proposed method was then compared against existing models,and the results on each dataset were recorded and demonstrated the effectiveness of the proposed model.The proposed model successfully showed improved test accuracy scores of 1%,2%,15%,and 5%on benchmarking multimedia datasets;Modified National Institute of Standards and Technology(MNIST)digits,Fashion MNIST,Pneumonia Chest X-ray,and Facial Emotion Detection Dataset,respectively.展开更多
Diabetes mellitus is a long-term condition characterized by hyperglycemia.It could lead to plenty of difficulties.According to rising morbidity in recent years,the world’s diabetic patients will exceed 642 million by...Diabetes mellitus is a long-term condition characterized by hyperglycemia.It could lead to plenty of difficulties.According to rising morbidity in recent years,the world’s diabetic patients will exceed 642 million by 2040,implying that one out of every ten persons will be diabetic.There is no doubt that this startling figure requires immediate attention from industry and academia to promote innovation and growth in diabetes risk prediction to save individuals’lives.Due to its rapid development,deep learning(DL)was used to predict numerous diseases.However,DLmethods still suffer from their limited prediction performance due to the hyperparameters selection and parameters optimization.Therefore,the selection of hyper-parameters is critical in improving classification performance.This study presents Convolutional Neural Network(CNN)that has achieved remarkable results in many medical domains where the Bayesian optimization algorithm(BOA)has been employed for hyperparameters selection and parameters optimization.Two issues have been investigated and solved during the experiment to enhance the results.The first is the dataset class imbalance,which is solved using Synthetic Minority Oversampling Technique(SMOTE)technique.The second issue is the model’s poor performance,which has been solved using the Bayesian optimization algorithm.The findings indicate that the Bayesian based-CNN model superbases all the state-of-the-art models in the literature with an accuracy of 89.36%,F1-score of 0.88.6,andMatthews Correlation Coefficient(MCC)of 0.88.6.展开更多
Electricity price forecasting is a subset of energy and power forecasting that focuses on projecting commercial electricity market present and future prices.Electricity price forecasting have been a critical input to ...Electricity price forecasting is a subset of energy and power forecasting that focuses on projecting commercial electricity market present and future prices.Electricity price forecasting have been a critical input to energy corporations’strategic decision-making systems over the last 15 years.Many strategies have been utilized for price forecasting in the past,however Artificial Intelligence Techniques(Fuzzy Logic and ANN)have proven to be more efficient than traditional techniques(Regression and Time Series).Fuzzy logic is an approach that uses membership functions(MF)and fuzzy inference model to forecast future electricity prices.Fuzzy c-means(FCM)is one of the popular clustering approach for generating fuzzy membership functions.However,the fuzzy c-means algorithm is limited to producing only one type of MFs,Gaussian MF.The generation of various fuzzy membership functions is critical since it allows for more efficient and optimal problem solutions.As a result,for the best and most improved results for electricity price forecasting,an approach to generate multiple type-1 fuzzy MFs using FCM algorithm is required.Therefore,the objective of this paper is to propose an approach for generating type-1 fuzzy triangular and trapezoidal MFs using FCM algorithm to overcome the limitations of the FCM algorithm.The approach is used to compute and improve forecasting accuracy for electricity prices,where Australian Energy Market Operator(AEMO)data is used.The results show that the proposed approach of using FCM to generate type-1 fuzzy MFs is effective and can be adopted.展开更多
The process of selecting features or reducing dimensionality can be viewed as a multi-objective minimization problem in which both the number of features and error rate must be minimized.While it is a multi-objective ...The process of selecting features or reducing dimensionality can be viewed as a multi-objective minimization problem in which both the number of features and error rate must be minimized.While it is a multi-objective problem,current methods tend to treat feature selection as a single-objective optimization task.This paper presents enhanced multi-objective grey wolf optimizer with Lévy flight and mutation phase(LMuMOGWO)for tackling feature selection problems.The proposed approach integrates two effective operators into the existing Multi-objective Grey Wolf optimizer(MOGWO):a Lévy flight and a mutation operator.The Lévy flight,a type of random walk with jump size determined by the Lévy distribution,enhances the global search capability of MOGWO,with the objective of maximizing classification accuracy while minimizing the number of selected features.The mutation operator is integrated to add more informative features that can assist in enhancing classification accuracy.As feature selection is a binary problem,the continuous search space is converted into a binary space using the sigmoid function.To evaluate the classification performance of the selected feature subset,the proposed approach employs a wrapper-based Artificial Neural Network(ANN).The effectiveness of the LMuMOGWO is validated on 12 conventional UCI benchmark datasets and compared with two existing variants of MOGWO,BMOGWO-S(based sigmoid),BMOGWO-V(based tanh)as well as Non-dominated Sorting Genetic Algorithm II(NSGA-II)and Multi-objective Particle Swarm Optimization(BMOPSO).The results demonstrate that the proposed LMuMOGWO approach is capable of successfully evolving and improving a set of randomly generated solutions for a given optimization problem.Moreover,the proposed approach outperforms existing approaches in most cases in terms of classification error rate,feature reduction,and computational cost.展开更多
文摘Purpose:The purpose of this study is to serve as a comprehensive review of the existing annotated corpora.This review study aims to provide information on the existing annotated corpora for event extraction,which are limited but essential for training and improving the existing event extraction algorithms.In addition to the primary goal of this study,it provides guidelines for preparing an annotated corpus and suggests suitable tools for the annotation task.Design/methodology/approach:This study employs an analytical approach to examine available corpus that is suitable for event extraction tasks.It offers an in-depth analysis of existing event extraction corpora and provides systematic guidelines for researchers to develop accurate,high-quality corpora.This ensures the reliability of the created corpus and its suitability for training machine learning algorithms.Findings:Our exploration reveals a scarcity of annotated corpora for event extraction tasks.In particular,the English corpora are mainly focused on the biomedical and general domains.Despite the issue of annotated corpora scarcity,there are several high-quality corpora available and widely used as benchmark datasets.However,access to some of these corpora might be limited owing to closed-access policies or discontinued maintenance after being initially released,rendering them inaccessible owing to broken links.Therefore,this study documents the available corpora for event extraction tasks.Research limitations:Our study focuses only on well-known corpora available in English and Chinese.Nevertheless,this study places a strong emphasis on the English corpora due to its status as a global lingua franca,making it widely understood compared to other languages.Practical implications:We genuinely believe that this study provides valuable knowledge that can serve as a guiding framework for preparing and accurately annotating events from text corpora.It provides comprehensive guidelines for researchers to improve the quality of corpus annotations,especially for event extraction tasks across various domains.Originality/value:This study comprehensively compiled information on the existing annotated corpora for event extraction tasks and provided preparation guidelines.
基金This research was supported by a Researchers Supporting Project Number(RSP2021/309)King Saud University,Riyadh,Saudi Arabia.The authors wish to acknowledge Yayasan Universiti Teknologi Petronas for supporting this work through the research grant(015LC0-308).
文摘Machine learning(ML)practices such as classification have played a very important role in classifying diseases in medical science.Since medical science is a sensitive field,the pre-processing of medical data requires careful handling to make quality clinical decisions.Generally,medical data is considered high-dimensional and complex data that contains many irrelevant and redundant features.These factors indirectly upset the disease prediction and classification accuracy of any ML model.To address this issue,various data pre-processing methods called Feature Selection(FS)techniques have been presented in the literature.However,the majority of such techniques frequently suffer from local minima issues due to large solution space.Thus,this study has proposed a novel wrapper-based Sand Cat SwarmOptimization(SCSO)technique as an FS approach to find optimum features from ten benchmark medical datasets.The SCSO algorithm replicates the hunting and searching strategies of the sand cat while having the advantage of avoiding local optima and finding the ideal solution with minimal control variables.Moreover,K-Nearest Neighbor(KNN)classifier was used to evaluate the effectiveness of the features identified by the proposed SCSO algorithm.The performance of the proposed SCSO algorithm was compared with six state-of-the-art and recent wrapper-based optimization algorithms using the validation metrics of classification accuracy,optimum feature size,and computational cost in seconds.The simulation results on the benchmark medical datasets revealed that the proposed SCSO-KNN approach has outperformed comparative algorithms with an average classification accuracy of 93.96%by selecting 14.2 features within 1.91 s.Additionally,the Wilcoxon rank test was used to perform the significance analysis between the proposed SCSOKNN method and six other algorithms for a p-value less than 5.00E-02.The findings revealed that the proposed algorithm produces better outcomes with an average p-value of 1.82E-02.Moreover,potential future directions are also suggested as a result of the study’s promising findings.
文摘Clustering high dimensional data is challenging as data dimensionality increases the distance between data points,resulting in sparse regions that degrade clustering performance.Subspace clustering is a common approach for processing high-dimensional data by finding relevant features for each cluster in the data space.Subspace clustering methods extend traditional clustering to account for the constraints imposed by data streams.Data streams are not only high-dimensional,but also unbounded and evolving.This necessitates the development of subspace clustering algorithms that can handle high dimensionality and adapt to the unique characteristics of data streams.Although many articles have contributed to the literature review on data stream clustering,there is currently no specific review on subspace clustering algorithms in high-dimensional data streams.Therefore,this article aims to systematically review the existing literature on subspace clustering of data streams in high-dimensional streaming environments.The review follows a systematic methodological approach and includes 18 articles for the final analysis.The analysis focused on two research questions related to the general clustering process and dealing with the unbounded and evolving characteristics of data streams.The main findings relate to six elements:clustering process,cluster search,subspace search,synopsis structure,cluster maintenance,and evaluation measures.Most algorithms use a two-phase clustering approach consisting of an initialization stage,a refinement stage,a cluster maintenance stage,and a final clustering stage.The density-based top-down subspace clustering approach is more widely used than the others because it is able to distinguish true clusters and outliers using projected microclusters.Most algorithms implicitly adapt to the evolving nature of the data stream by using a time fading function that is sensitive to outliers.Future work can focus on the clustering framework,parameter optimization,subspace search techniques,memory-efficient synopsis structures,explicit cluster change detection,and intrinsic performance metrics.This article can serve as a guide for researchers interested in high-dimensional subspace clustering methods for data streams.
基金This research was supported by Universiti Teknologi PETRONAS,under the Yayasan Universiti Teknologi PETRONAS(YUTP)Fundamental Research Grant Scheme(YUTPFRG/015LC0-308).
文摘Generative Adversarial Networks(GANs)are neural networks that allow models to learn deep representations without requiring a large amount of training data.Semi-Supervised GAN Classifiers are a recent innovation in GANs,where GANs are used to classify generated images into real and fake and multiple classes,similar to a general multi-class classifier.However,GANs have a sophisticated design that can be challenging to train.This is because obtaining the proper set of parameters for all models-generator,discriminator,and classifier is complex.As a result,training a single GAN model for different datasets may not produce satisfactory results.Therefore,this study proposes an SGAN model(Semi-Supervised GAN Classifier).First,a baseline model was constructed.The model was then enhanced by leveraging the Sine-Cosine Algorithm and Synthetic Minority Oversampling Technique(SMOTE).SMOTE was used to address class imbalances in the dataset,while Sine Cosine Algorithm(SCA)was used to optimize the weights of the classifier models.The optimal set of hyperparameters(learning rate and batch size)were obtained using grid manual search.Four well-known benchmark datasets and a set of evaluation measures were used to validate the proposed model.The proposed method was then compared against existing models,and the results on each dataset were recorded and demonstrated the effectiveness of the proposed model.The proposed model successfully showed improved test accuracy scores of 1%,2%,15%,and 5%on benchmarking multimedia datasets;Modified National Institute of Standards and Technology(MNIST)digits,Fashion MNIST,Pneumonia Chest X-ray,and Facial Emotion Detection Dataset,respectively.
基金This research/paper was fully supported by Universiti Teknologi PETRONAS,under the Yayasan Universiti Teknologi PETRONAS(YUTP)Fundamental Research Grant Scheme(015LC0-311).
文摘Diabetes mellitus is a long-term condition characterized by hyperglycemia.It could lead to plenty of difficulties.According to rising morbidity in recent years,the world’s diabetic patients will exceed 642 million by 2040,implying that one out of every ten persons will be diabetic.There is no doubt that this startling figure requires immediate attention from industry and academia to promote innovation and growth in diabetes risk prediction to save individuals’lives.Due to its rapid development,deep learning(DL)was used to predict numerous diseases.However,DLmethods still suffer from their limited prediction performance due to the hyperparameters selection and parameters optimization.Therefore,the selection of hyper-parameters is critical in improving classification performance.This study presents Convolutional Neural Network(CNN)that has achieved remarkable results in many medical domains where the Bayesian optimization algorithm(BOA)has been employed for hyperparameters selection and parameters optimization.Two issues have been investigated and solved during the experiment to enhance the results.The first is the dataset class imbalance,which is solved using Synthetic Minority Oversampling Technique(SMOTE)technique.The second issue is the model’s poor performance,which has been solved using the Bayesian optimization algorithm.The findings indicate that the Bayesian based-CNN model superbases all the state-of-the-art models in the literature with an accuracy of 89.36%,F1-score of 0.88.6,andMatthews Correlation Coefficient(MCC)of 0.88.6.
基金This research is an ongoing research supported by Yayasan UTP Grant(015LC0-321&015LC0-311)Fundamental Research Grant Scheme(FRGS/1/2018/ICT02/UTP/02/1)a grant funded by the Ministry of Higher Education,Malaysia.
文摘Electricity price forecasting is a subset of energy and power forecasting that focuses on projecting commercial electricity market present and future prices.Electricity price forecasting have been a critical input to energy corporations’strategic decision-making systems over the last 15 years.Many strategies have been utilized for price forecasting in the past,however Artificial Intelligence Techniques(Fuzzy Logic and ANN)have proven to be more efficient than traditional techniques(Regression and Time Series).Fuzzy logic is an approach that uses membership functions(MF)and fuzzy inference model to forecast future electricity prices.Fuzzy c-means(FCM)is one of the popular clustering approach for generating fuzzy membership functions.However,the fuzzy c-means algorithm is limited to producing only one type of MFs,Gaussian MF.The generation of various fuzzy membership functions is critical since it allows for more efficient and optimal problem solutions.As a result,for the best and most improved results for electricity price forecasting,an approach to generate multiple type-1 fuzzy MFs using FCM algorithm is required.Therefore,the objective of this paper is to propose an approach for generating type-1 fuzzy triangular and trapezoidal MFs using FCM algorithm to overcome the limitations of the FCM algorithm.The approach is used to compute and improve forecasting accuracy for electricity prices,where Australian Energy Market Operator(AEMO)data is used.The results show that the proposed approach of using FCM to generate type-1 fuzzy MFs is effective and can be adopted.
基金supported by Universiti Teknologi PETRONAS,under the Yayasan Universiti Teknologi PETRONAS (YUTP)Fundamental Research Grant Scheme (YUTPFRG/015LC0-274)support by Researchers Supporting Project Number (RSP-2023/309),King Saud University,Riyadh,Saudi Arabia.
文摘The process of selecting features or reducing dimensionality can be viewed as a multi-objective minimization problem in which both the number of features and error rate must be minimized.While it is a multi-objective problem,current methods tend to treat feature selection as a single-objective optimization task.This paper presents enhanced multi-objective grey wolf optimizer with Lévy flight and mutation phase(LMuMOGWO)for tackling feature selection problems.The proposed approach integrates two effective operators into the existing Multi-objective Grey Wolf optimizer(MOGWO):a Lévy flight and a mutation operator.The Lévy flight,a type of random walk with jump size determined by the Lévy distribution,enhances the global search capability of MOGWO,with the objective of maximizing classification accuracy while minimizing the number of selected features.The mutation operator is integrated to add more informative features that can assist in enhancing classification accuracy.As feature selection is a binary problem,the continuous search space is converted into a binary space using the sigmoid function.To evaluate the classification performance of the selected feature subset,the proposed approach employs a wrapper-based Artificial Neural Network(ANN).The effectiveness of the LMuMOGWO is validated on 12 conventional UCI benchmark datasets and compared with two existing variants of MOGWO,BMOGWO-S(based sigmoid),BMOGWO-V(based tanh)as well as Non-dominated Sorting Genetic Algorithm II(NSGA-II)and Multi-objective Particle Swarm Optimization(BMOPSO).The results demonstrate that the proposed LMuMOGWO approach is capable of successfully evolving and improving a set of randomly generated solutions for a given optimization problem.Moreover,the proposed approach outperforms existing approaches in most cases in terms of classification error rate,feature reduction,and computational cost.