Unstable angina(UA) is the most dangerous type of Coronary Heart Disease(CHD) to cause more and more mortal and morbid world wide. Identification of biomarkers for UA at the level of proteomics and metabolomics is...Unstable angina(UA) is the most dangerous type of Coronary Heart Disease(CHD) to cause more and more mortal and morbid world wide. Identification of biomarkers for UA at the level of proteomics and metabolomics is a better avenue to understand the inner mechanism of it. Feature selection based data mining method is better suited to identify biomarkers of UA. In this study, we carried out clinical epidemiology to collect plasmas of UA in-patients and controls. Proteomics and metabolomics data were obtained via two-dimensional difference gel electrophoresis and gas chromatography techniques. We presented a novel computational strategy to select biomarkers as few as possible for UA in the two groups of data. Firstly, decision tree was used to select biomarkers for UA and 3-fold cross validation was used to evaluate computational performanees for the three methods. Alternatively, we combined inde- pendent t test and classification based data mining method as well as backward elimination technique to select, as few as possible, protein and metabolite biomarkers with best classification performances. By the method, we selected 6 proteins and 5 metabolites for UA. The novel method presented here provides a better insight into the pathology of a disease.展开更多
Data mining in the educational field can be used to optimize the teaching and learning performance among the students.The recently developed machine learning(ML)and deep learning(DL)approaches can be utilized to mine ...Data mining in the educational field can be used to optimize the teaching and learning performance among the students.The recently developed machine learning(ML)and deep learning(DL)approaches can be utilized to mine the data effectively.This study proposes an Improved Sailfish Optimizer-based Feature SelectionwithOptimal Stacked Sparse Autoencoder(ISOFS-OSSAE)for data mining and pattern recognition in the educational sector.The proposed ISOFS-OSSAE model aims to mine the educational data and derive decisions based on the feature selection and classification process.Moreover,the ISOFS-OSSAEmodel involves the design of the ISOFS technique to choose an optimal subset of features.Moreover,the swallow swarm optimization(SSO)with the SSAE model is derived to perform the classification process.To showcase the enhanced outcomes of the ISOFSOSSAE model,a wide range of experiments were taken place on a benchmark dataset from the University of California Irvine(UCI)Machine Learning Repository.The simulation results pointed out the improved classification performance of the ISOFS-OSSAE model over the recent state of art approaches interms of different performance measures.展开更多
Datamining plays a crucial role in extractingmeaningful knowledge fromlarge-scale data repositories,such as data warehouses and databases.Association rule mining,a fundamental process in data mining,involves discoveri...Datamining plays a crucial role in extractingmeaningful knowledge fromlarge-scale data repositories,such as data warehouses and databases.Association rule mining,a fundamental process in data mining,involves discovering correlations,patterns,and causal structures within datasets.In the healthcare domain,association rules offer valuable opportunities for building knowledge bases,enabling intelligent diagnoses,and extracting invaluable information rapidly.This paper presents a novel approach called the Machine Learning based Association Rule Mining and Classification for Healthcare Data Management System(MLARMC-HDMS).The MLARMC-HDMS technique integrates classification and association rule mining(ARM)processes.Initially,the chimp optimization algorithm-based feature selection(COAFS)technique is employed within MLARMC-HDMS to select relevant attributes.Inspired by the foraging behavior of chimpanzees,the COA algorithm mimics their search strategy for food.Subsequently,the classification process utilizes stochastic gradient descent with a multilayer perceptron(SGD-MLP)model,while the Apriori algorithm determines attribute relationships.We propose a COA-based feature selection approach for medical data classification using machine learning techniques.This approach involves selecting pertinent features from medical datasets through COA and training machine learning models using the reduced feature set.We evaluate the performance of our approach on various medical datasets employing diverse machine learning classifiers.Experimental results demonstrate that our proposed approach surpasses alternative feature selection methods,achieving higher accuracy and precision rates in medical data classification tasks.The study showcases the effectiveness and efficiency of the COA-based feature selection approach in identifying relevant features,thereby enhancing the diagnosis and treatment of various diseases.To provide further validation,we conduct detailed experiments on a benchmark medical dataset,revealing the superiority of the MLARMCHDMS model over other methods,with a maximum accuracy of 99.75%.Therefore,this research contributes to the advancement of feature selection techniques in medical data classification and highlights the potential for improving healthcare outcomes through accurate and efficient data analysis.The presented MLARMC-HDMS framework and COA-based feature selection approach offer valuable insights for researchers and practitioners working in the field of healthcare data mining and machine learning.展开更多
Educational Data Mining(EDM)is an emergent discipline that concen-trates on the design of self-learning and adaptive approaches.Higher education institutions have started to utilize analytical tools to improve student...Educational Data Mining(EDM)is an emergent discipline that concen-trates on the design of self-learning and adaptive approaches.Higher education institutions have started to utilize analytical tools to improve students’grades and retention.Prediction of students’performance is a difficult process owing to the massive quantity of educational data.Therefore,Artificial Intelligence(AI)techniques can be used for educational data mining in a big data environ-ment.At the same time,in EDM,the feature selection process becomes necessary in creation of feature subsets.Since the feature selection performance affects the predictive performance of any model,it is important to elaborately investigate the outcome of students’performance model related to the feature selection techni-ques.With this motivation,this paper presents a new Metaheuristic Optimiza-tion-based Feature Subset Selection with an Optimal Deep Learning model(MOFSS-ODL)for predicting students’performance.In addition,the proposed model uses an isolation forest-based outlier detection approach to eliminate the existence of outliers.Besides,the Chaotic Monarch Butterfly Optimization Algo-rithm(CBOA)is used for the selection of highly related features with low com-plexity and high performance.Then,a sailfish optimizer with stacked sparse autoencoder(SFO-SSAE)approach is utilized for the classification of educational data.The MOFSS-ODL model is tested against a benchmark student’s perfor-mance data set from the UCI repository.A wide-ranging simulation analysis por-trayed the improved predictive performance of the MOFSS-ODL technique over recent approaches in terms of different measures.Compared to other methods,experimental results prove that the proposed(MOFSS-ODL)classification model does a great job of predicting students’academic progress,with an accuracy of 96.49%.展开更多
Big Data applications face different types of complexities in classifications.Cleaning and purifying data by eliminating irrelevant or redundant data for big data applications becomes a complex operation while attempt...Big Data applications face different types of complexities in classifications.Cleaning and purifying data by eliminating irrelevant or redundant data for big data applications becomes a complex operation while attempting to maintain discriminative features in processed data.The existing scheme has many disadvantages including continuity in training,more samples and training time in feature selections and increased classification execution times.Recently ensemble methods have made a mark in classification tasks as combine multiple results into a single representation.When comparing to a single model,this technique offers for improved prediction.Ensemble based feature selections parallel multiple expert’s judgments on a single topic.The major goal of this research is to suggest HEFSM(Heterogeneous Ensemble Feature Selection Model),a hybrid approach that combines multiple algorithms.The major goal of this research is to suggest HEFSM(Heterogeneous Ensemble Feature Selection Model),a hybrid approach that combines multiple algorithms.Further,individual outputs produced by methods producing subsets of features or rankings or voting are also combined in this work.KNN(K-Nearest Neighbor)classifier is used to classify the big dataset obtained from the ensemble learning approach.The results found of the study have been good,proving the proposed model’s efficiency in classifications in terms of the performance metrics like precision,recall,F-measure and accuracy used.展开更多
Data size plays a significant role in the design and the performance of data mining models.A good feature selection algorithm reduces the problems of big data size and noise due to data redundancy.Features selection a...Data size plays a significant role in the design and the performance of data mining models.A good feature selection algorithm reduces the problems of big data size and noise due to data redundancy.Features selection algorithms aim at selecting the best features and eliminating unnecessary ones,which in turn simplifies the structure of the data mining model as well as increases its performance.This paper introduces a robust features selection algorithm,named Features Ranking Voting Algorithm FRV.It merges the benefits of the different features selection algorithms to specify the features ranks in the dataset correctly and robustly;based on the feature ranks and voting algorithm.The FRV comprises of three different proposed techniques to select the minimum best feature set,the forward voting technique to select the best high ranks features,the backward voting technique,which drops the low ranks features(low importance feature),and the third technique merges the outputs from the forward and backward techniques to maximize the robustness of the selected features set.Different data mining models were built using obtained selected features sets from applying the proposed FVR on different datasets;to evaluate the success behavior of the proposed FRV.The high performance of these data mining models reflects the success of the proposed FRV algorithm.The FRV performance is compared with other features selection algorithms.It successes to develop data mining models for the Hungarian CAD dataset with Acc.of 96.8%,and with Acc.of 96%for the Z-Alizadeh Sani CAD dataset compared with 83.94%and 92.56%respectively in[48].展开更多
The cloud computing platform has the functions of efficiently allocating the dynamic resources, generating the dynamic computing and storage according to the user requests, and providing the good platform for the big ...The cloud computing platform has the functions of efficiently allocating the dynamic resources, generating the dynamic computing and storage according to the user requests, and providing the good platform for the big data feature analysis and mining. The big data feature mining in the cloud computing environment is an effective method for the elficient application of the massive data in the information age. In the process of the big data mining, the method o f the big data feature mining based on the gradient sampling has the poor logicality. It only mines the big data features from a single-level perspective, which reduces the precision of the big data feature mining.展开更多
To promote behavioral change among adolescents in Zambia, the National HIV/AIDS/STI/TB Council, in collaboration with UNICEF, developed the Zambia U-Report platform. This platform provides young people with improved a...To promote behavioral change among adolescents in Zambia, the National HIV/AIDS/STI/TB Council, in collaboration with UNICEF, developed the Zambia U-Report platform. This platform provides young people with improved access to information on various Sexual Reproductive Health topics through Short Messaging Service (SMS) messages. Over the years, the platform has accumulated millions of incoming and outgoing messages, which need to be categorized into key thematic areas for better tracking of sexual reproductive health knowledge gaps among young people. The current manual categorization process of these text messages is inefficient and time-consuming and this study aims to automate the process for improved analysis using text-mining techniques. Firstly, the study investigates the current text message categorization process and identifies a list of categories adopted by counselors over time which are then used to build and train a categorization model. Secondly, the study presents a proof of concept tool that automates the categorization of U-report messages into key thematic areas using the developed categorization model. Finally, it compares the performance and effectiveness of the developed proof of concept tool against the manual system. The study used a dataset comprising 206,625 text messages. The current process would take roughly 2.82 years to categorise this dataset whereas the trained SVM model would require only 6.4 minutes while achieving an accuracy of 70.4% demonstrating that the automated method is significantly faster, more scalable, and consistent when compared to the current manual categorization. These advantages make the SVM model a more efficient and effective tool for categorizing large unstructured text datasets. These results and the proof-of-concept tool developed demonstrate the potential for enhancing the efficiency and accuracy of message categorization on the Zambia U-report platform and other similar text messages-based platforms.展开更多
The relation between mining pressure field-fracture field and gas emission of working face is analyzed, and the concept that there is a stress point (or strain point) among permeability of coal is presented. It is b...The relation between mining pressure field-fracture field and gas emission of working face is analyzed, and the concept that there is a stress point (or strain point) among permeability of coal is presented. It is believed that the mutation of coal permeability caused by the sudden loading or unloading of working face roof as periodic weighting occurs is the main reason that a lot of gas pour into the working face. Based on the above concept, the relation is established among abutment pressure during periodie weighting, permeability of coal seam and gas emission, and relation graph is drawn. Then the loading and unloading features of coal at the moment of fracture and non-fracture of main roof are revealed. And finally it is presented that the process of sudden loading or unloading as periodic weighting occurs plays an important role in rupture propagation of coal, analytical movement of gas and gas emission.展开更多
Single-feature methods are unable to effectively track a target in an underground coal mine video due to the high background noise, low and uneven illumination, and drastic light fluctuation in the video. In this stud...Single-feature methods are unable to effectively track a target in an underground coal mine video due to the high background noise, low and uneven illumination, and drastic light fluctuation in the video. In this study, we propose an underground coal mine personnel target tracking method using multi-feature joint sparse representation. First, with a particle filter framework, the global and local multiple features of the target template and candidate particles are extracted. Second, each of the candidate particles is sparsely represented by a dictionary template, and reconstruction is achieved after solving the sparse coefficient. Last, the particle with the lowest reconstruction error is deemed the tracking result. To validate the effectiveness of the proposed algorithm, we compare the proposed method with three commonly employed tracking algorithms. The results show that the proposed method is able to reliably track the target in various scenarios, such as occlusion and illumination change, which generates better tracking results and validates the feasibility and effectiveness of the proposed method.展开更多
Particle Swarm Optimization (PSO) is a popular and bionic algorithm based on the social behavior associated with bird flocking for optimization problems. To maintain the diversity of swarms, a few studies of multi-s...Particle Swarm Optimization (PSO) is a popular and bionic algorithm based on the social behavior associated with bird flocking for optimization problems. To maintain the diversity of swarms, a few studies of multi-swarm strategy have been reported. However, the competition among swarms, reservation or destruction of a swarm, has not been considered further. In this paper, we formulate four rules by introducing the mechanism for survival of the fittest, which simulates the competition among the swarms. Based on the mechanism, we design a modified Multi-Swarm PSO (MSPSO) to solve discrete problems, which consists of a number of sub-swarms and a multi-swarm scheduler that can monitor and control each sub-swarm using the rules. To further settle the feature selection problems, we propose an Improved Feature Selection (1FS) method by integrating MSPSO, Support Vector Machines (SVM) with F-score method. The IFS method aims to achieve higher generalization capa- bility through performing kernel parameter optimization and feature selection simultaneously. The performance of the proposed method is compared with that of the standard PSO based, Genetic Algorithm (GA) based and the grid search based mcthods on 10 benchmark datasets, taken from UCI machine learning and StatLog databases. The numerical results and statistical analysis show that the proposed IFS method performs significantly better than the other three methods in terms of prediction accuracy with smaller subset of features.展开更多
This paper proposes one method of feature selection by using Bayes' theorem. The purpose of the proposed method is to reduce the computational complexity and increase the classification accuracy of the selected featu...This paper proposes one method of feature selection by using Bayes' theorem. The purpose of the proposed method is to reduce the computational complexity and increase the classification accuracy of the selected feature subsets. The dependence between two attributes (binary) is determined based on the probabilities of their joint values that contribute to positive and negative classification decisions. If opposing sets of attribute values do not lead to opposing classification decisions (zero probability), then the two attributes are considered independent of each other, otherwise dependent, and one of them can be removed and thus the number of attributes is reduced. The process must be repeated on all combinations of attributes. The paper also evaluates the approach by comparing it with existing feature selection algorithms over 8 datasets from University of California, Irvine (UCI) machine learning databases. The proposed method shows better results in terms of number of selected features, classification accuracy, and running time than most existing algorithms.展开更多
The rapid growth of social networks has produced an unprecedented amount of user-generated data, which provides an excellent opportunity for text mining. Authorship analysis, an important part of text mining, attempts...The rapid growth of social networks has produced an unprecedented amount of user-generated data, which provides an excellent opportunity for text mining. Authorship analysis, an important part of text mining, attempts to learn about the author of the text through subtle variations in the writing styles that occur between gender, age and social groups. Such information has a variety of applications including advertising and law enforcement. One of the most accessible sources of user-generated data is Twitter, which makes the majority of its user data freely available through its data access API. In this study we seek to identify the gender of users on Twitter using Perceptron and Nai ve Bayes with selected 1 through 5-gram features from tweet text. Stream applications of these algorithms were employed for gender prediction to handle the speed and volume of tweet traffic. Because informal text, such as tweets, cannot be easily evaluated using traditional dictionary methods, n-gram features were implemented in this study to represent streaming tweets. The large number of 1 through 5-grams requires that only a subset of them be used in gender classification, for this reason informative n-gram features were chosen using multiple selection algorithms. In the best case the Naive Bayes and Perceptron algorithms produced accuracy, balanced accuracy, and F-measure above 99%.展开更多
It is known that the exploitation of opencast coal mines has seriously damaged the environments in the semi-arid areas.Vegetation status can reliably reflect the ecological degeneration and restoration in the opencast...It is known that the exploitation of opencast coal mines has seriously damaged the environments in the semi-arid areas.Vegetation status can reliably reflect the ecological degeneration and restoration in the opencast mining areas in the semi-arid areas.Long-time series MODIS NDVI data are widely used to simulate the vegetation cover to reflect the disturbance and restoration of local ecosystems.In this study, both qualitative(linear regression method and coefficient of variation(CoV)) and quantitative(spatial buffer analysis, and change amplitude and the rate of change in the average NDVI) analyses were conducted to analyze the spatio-temporal dynamics of vegetation during 2000–2017 in Jungar Banner of Inner Mongolia Autonomous Region, China, at the large(Jungar Banner and three mine groups) and small(three types of functional areas: opencast coal mining excavation areas, reclamation areas and natural areas) scales.The results show that the rates of change in the average NDVI in the reclamation areas(20%–60%) and opencast coal mining excavation areas(10%–20%) were considerably higher than that in the natural areas(<7%).The vegetation in the reclamation areas experienced a trend of increase(3–5 a after reclamation)-decrease(the sixth year of reclamation)-stability.The vegetation in Jungar Banner has a spatial heterogeneity under the influences of mining and reclamation activities.The ratio of vegetation improvement area to vegetation degradation area in the west, southwest and east mine groups during 2000–2017 was 8:1, 20:1 and 33:1, respectively.The regions with the high CoV of NDVI above 0.45 were mainly distributed around the opencast coal mining excavation areas, and the regions with the CoV of NDVI above 0.25 were mostly located in areas with low(28.8%) and medium-low(10.2%) vegetation cover.The average disturbance distances of mining activities on vegetation in the three mine groups(west, southwest and east) were 800, 800 and 1000 m, respectively.The greater the scale of mining, the farther the disturbance distances of mining activities on vegetation.We conclude that vegetation reclamation will certainly compensate for the negative impacts of opencast coal mining activities on vegetation.Sufficient attention should be paid to the proportional allocation of plant species(herbs and shrubs) in the reclamation areas, and the restored vegetation in these areas needs to be protected for more than 6 a.Then, as the repair time increased, the vegetation condition of the reclamation areas would exceed that of the natural areas.展开更多
Mutual information is an important information measure for feature subset. In this paper, a hashing mechanism is proposed to calculate the mutual information on the feature subset. Redundancy-synergy coefficient, a no...Mutual information is an important information measure for feature subset. In this paper, a hashing mechanism is proposed to calculate the mutual information on the feature subset. Redundancy-synergy coefficient, a novel redundancy and synergy measure of features to express the class feature, is defined by mutual information. The information maximization rule was applied to derive the heuristic feature subset selection method based on mutual information and redundancy-synergy coefficient. Our experiment results showed the good performance of the new feature selection method.展开更多
In the area of pattern recognition and machine learning,features play a key role in prediction.The famous applications of features are medical imaging,image classification,and name a few more.With the exponential grow...In the area of pattern recognition and machine learning,features play a key role in prediction.The famous applications of features are medical imaging,image classification,and name a few more.With the exponential growth of information investments in medical data repositories and health service provision,medical institutions are collecting large volumes of data.These data repositories contain details information essential to support medical diagnostic decisions and also improve patient care quality.On the other hand,this growth also made it difficult to comprehend and utilize data for various purposes.The results of imaging data can become biased because of extraneous features present in larger datasets.Feature selection gives a chance to decrease the number of components in such large datasets.Through selection techniques,ousting the unimportant features and selecting a subset of components that produces prevalent characterization precision.The correct decision to find a good attribute produces a precise grouping model,which enhances learning pace and forecast control.This paper presents a review of feature selection techniques and attributes selection measures for medical imaging.This review is meant to describe feature selection techniques in a medical domainwith their pros and cons and to signify its application in imaging data and data mining algorithms.The review reveals the shortcomings of the existing feature and attributes selection techniques to multi-sourced data.Moreover,this review provides the importance of feature selection for correct classification of medical infections.In the end,critical analysis and future directions are provided.展开更多
Spam is no longer just commercial unsolicited email messages that waste our time, it consumes network traffic and mail servers’ storage. Furthermore, spam has become a major component of several attack vectors includ...Spam is no longer just commercial unsolicited email messages that waste our time, it consumes network traffic and mail servers’ storage. Furthermore, spam has become a major component of several attack vectors including attacks such as phishing, cross-site scripting, cross-site request forgery and malware infection. Statistics show that the amount of spam containing malicious contents increased compared to the one advertising legitimate products and services. In this paper, the issue of spam detection is investigated with the aim to develop an efficient method to identify spam email based on the analysis of the content of email messages. We identify a set of features that have a considerable number of malicious related features. Our goal is to study the effect of these features in helping the classical classifiers in identifying spam emails. To make the problem more challenging, we developed spam classification models based on imbalanced data where spam emails form the rare class with only 16.5% of the total emails. Different metrics were utilized in the evaluation of the developed models. Results show noticeable improvement of spam classification models when trained by dataset that includes malicious related features.展开更多
Student performance prediction helps the educational stakeholders to take proactive decisions and make interventions,for the improvement of quality of education and to meet the dynamic needs of society.The selection o...Student performance prediction helps the educational stakeholders to take proactive decisions and make interventions,for the improvement of quality of education and to meet the dynamic needs of society.The selection of features for student’s performance prediction not only plays significant role in increasing prediction accuracy,but also helps in building the strategic plans for the improvement of students’academic performance.There are different feature selection algorithms for predicting the performance of students,however the studies reported in the literature claim that there are different pros and cons of existing feature selection algorithms in selection of optimal features.In this paper,a hybrid feature selection framework(using feature-fusion)is designed to identify the significant features and associated features with target class,to predict the performance of students.The main goal of the proposed hybrid feature selection is not only to improve the prediction accuracy,but also to identify optimal features for building productive strategies for the improvement in students’academic performance.The key difference between proposed hybrid feature selection framework and existing hybrid feature selection framework,is two level feature fusion technique,with the utilization of cosine-based fusion.Whereas,according to the results reported in existing literature,cosine similarity is considered as the best similarity measure among existing similarity measures.The proposed hybrid feature selection is validated on four benchmark datasets with variations in number of features and number of instances.The validated results confirm that the proposed hybrid feature selection framework performs better than the existing hybrid feature selection framework,existing feature selection algorithms in terms of accuracy,f-measure,recall,and precision.Results reported in presented paper show that the proposed approach gives more than 90%accuracy on benchmark dataset that is better than the results of existing approach.展开更多
基金Supported by the National Basic Research Program of China(No2011CB505106)the National Natural Science Foundation of China(No30902020)+2 种基金the Foundation of National Department of Public Benefit Research of China(No200807007)the Creation Fund for Significant New Drugs of China(No2009ZX09502-018)the Foundation of International Science and Technology Cooperation of China(No2008DFA30610)
文摘Unstable angina(UA) is the most dangerous type of Coronary Heart Disease(CHD) to cause more and more mortal and morbid world wide. Identification of biomarkers for UA at the level of proteomics and metabolomics is a better avenue to understand the inner mechanism of it. Feature selection based data mining method is better suited to identify biomarkers of UA. In this study, we carried out clinical epidemiology to collect plasmas of UA in-patients and controls. Proteomics and metabolomics data were obtained via two-dimensional difference gel electrophoresis and gas chromatography techniques. We presented a novel computational strategy to select biomarkers as few as possible for UA in the two groups of data. Firstly, decision tree was used to select biomarkers for UA and 3-fold cross validation was used to evaluate computational performanees for the three methods. Alternatively, we combined inde- pendent t test and classification based data mining method as well as backward elimination technique to select, as few as possible, protein and metabolite biomarkers with best classification performances. By the method, we selected 6 proteins and 5 metabolites for UA. The novel method presented here provides a better insight into the pathology of a disease.
文摘Data mining in the educational field can be used to optimize the teaching and learning performance among the students.The recently developed machine learning(ML)and deep learning(DL)approaches can be utilized to mine the data effectively.This study proposes an Improved Sailfish Optimizer-based Feature SelectionwithOptimal Stacked Sparse Autoencoder(ISOFS-OSSAE)for data mining and pattern recognition in the educational sector.The proposed ISOFS-OSSAE model aims to mine the educational data and derive decisions based on the feature selection and classification process.Moreover,the ISOFS-OSSAEmodel involves the design of the ISOFS technique to choose an optimal subset of features.Moreover,the swallow swarm optimization(SSO)with the SSAE model is derived to perform the classification process.To showcase the enhanced outcomes of the ISOFSOSSAE model,a wide range of experiments were taken place on a benchmark dataset from the University of California Irvine(UCI)Machine Learning Repository.The simulation results pointed out the improved classification performance of the ISOFS-OSSAE model over the recent state of art approaches interms of different performance measures.
基金Deputyship for Research&Innovation,Ministry of Education in Saudi Arabia for funding this research work through the Project Number RI-44-0444.
文摘Datamining plays a crucial role in extractingmeaningful knowledge fromlarge-scale data repositories,such as data warehouses and databases.Association rule mining,a fundamental process in data mining,involves discovering correlations,patterns,and causal structures within datasets.In the healthcare domain,association rules offer valuable opportunities for building knowledge bases,enabling intelligent diagnoses,and extracting invaluable information rapidly.This paper presents a novel approach called the Machine Learning based Association Rule Mining and Classification for Healthcare Data Management System(MLARMC-HDMS).The MLARMC-HDMS technique integrates classification and association rule mining(ARM)processes.Initially,the chimp optimization algorithm-based feature selection(COAFS)technique is employed within MLARMC-HDMS to select relevant attributes.Inspired by the foraging behavior of chimpanzees,the COA algorithm mimics their search strategy for food.Subsequently,the classification process utilizes stochastic gradient descent with a multilayer perceptron(SGD-MLP)model,while the Apriori algorithm determines attribute relationships.We propose a COA-based feature selection approach for medical data classification using machine learning techniques.This approach involves selecting pertinent features from medical datasets through COA and training machine learning models using the reduced feature set.We evaluate the performance of our approach on various medical datasets employing diverse machine learning classifiers.Experimental results demonstrate that our proposed approach surpasses alternative feature selection methods,achieving higher accuracy and precision rates in medical data classification tasks.The study showcases the effectiveness and efficiency of the COA-based feature selection approach in identifying relevant features,thereby enhancing the diagnosis and treatment of various diseases.To provide further validation,we conduct detailed experiments on a benchmark medical dataset,revealing the superiority of the MLARMCHDMS model over other methods,with a maximum accuracy of 99.75%.Therefore,this research contributes to the advancement of feature selection techniques in medical data classification and highlights the potential for improving healthcare outcomes through accurate and efficient data analysis.The presented MLARMC-HDMS framework and COA-based feature selection approach offer valuable insights for researchers and practitioners working in the field of healthcare data mining and machine learning.
文摘Educational Data Mining(EDM)is an emergent discipline that concen-trates on the design of self-learning and adaptive approaches.Higher education institutions have started to utilize analytical tools to improve students’grades and retention.Prediction of students’performance is a difficult process owing to the massive quantity of educational data.Therefore,Artificial Intelligence(AI)techniques can be used for educational data mining in a big data environ-ment.At the same time,in EDM,the feature selection process becomes necessary in creation of feature subsets.Since the feature selection performance affects the predictive performance of any model,it is important to elaborately investigate the outcome of students’performance model related to the feature selection techni-ques.With this motivation,this paper presents a new Metaheuristic Optimiza-tion-based Feature Subset Selection with an Optimal Deep Learning model(MOFSS-ODL)for predicting students’performance.In addition,the proposed model uses an isolation forest-based outlier detection approach to eliminate the existence of outliers.Besides,the Chaotic Monarch Butterfly Optimization Algo-rithm(CBOA)is used for the selection of highly related features with low com-plexity and high performance.Then,a sailfish optimizer with stacked sparse autoencoder(SFO-SSAE)approach is utilized for the classification of educational data.The MOFSS-ODL model is tested against a benchmark student’s perfor-mance data set from the UCI repository.A wide-ranging simulation analysis por-trayed the improved predictive performance of the MOFSS-ODL technique over recent approaches in terms of different measures.Compared to other methods,experimental results prove that the proposed(MOFSS-ODL)classification model does a great job of predicting students’academic progress,with an accuracy of 96.49%.
文摘Big Data applications face different types of complexities in classifications.Cleaning and purifying data by eliminating irrelevant or redundant data for big data applications becomes a complex operation while attempting to maintain discriminative features in processed data.The existing scheme has many disadvantages including continuity in training,more samples and training time in feature selections and increased classification execution times.Recently ensemble methods have made a mark in classification tasks as combine multiple results into a single representation.When comparing to a single model,this technique offers for improved prediction.Ensemble based feature selections parallel multiple expert’s judgments on a single topic.The major goal of this research is to suggest HEFSM(Heterogeneous Ensemble Feature Selection Model),a hybrid approach that combines multiple algorithms.The major goal of this research is to suggest HEFSM(Heterogeneous Ensemble Feature Selection Model),a hybrid approach that combines multiple algorithms.Further,individual outputs produced by methods producing subsets of features or rankings or voting are also combined in this work.KNN(K-Nearest Neighbor)classifier is used to classify the big dataset obtained from the ensemble learning approach.The results found of the study have been good,proving the proposed model’s efficiency in classifications in terms of the performance metrics like precision,recall,F-measure and accuracy used.
文摘Data size plays a significant role in the design and the performance of data mining models.A good feature selection algorithm reduces the problems of big data size and noise due to data redundancy.Features selection algorithms aim at selecting the best features and eliminating unnecessary ones,which in turn simplifies the structure of the data mining model as well as increases its performance.This paper introduces a robust features selection algorithm,named Features Ranking Voting Algorithm FRV.It merges the benefits of the different features selection algorithms to specify the features ranks in the dataset correctly and robustly;based on the feature ranks and voting algorithm.The FRV comprises of three different proposed techniques to select the minimum best feature set,the forward voting technique to select the best high ranks features,the backward voting technique,which drops the low ranks features(low importance feature),and the third technique merges the outputs from the forward and backward techniques to maximize the robustness of the selected features set.Different data mining models were built using obtained selected features sets from applying the proposed FVR on different datasets;to evaluate the success behavior of the proposed FRV.The high performance of these data mining models reflects the success of the proposed FRV algorithm.The FRV performance is compared with other features selection algorithms.It successes to develop data mining models for the Hungarian CAD dataset with Acc.of 96.8%,and with Acc.of 96%for the Z-Alizadeh Sani CAD dataset compared with 83.94%and 92.56%respectively in[48].
文摘The cloud computing platform has the functions of efficiently allocating the dynamic resources, generating the dynamic computing and storage according to the user requests, and providing the good platform for the big data feature analysis and mining. The big data feature mining in the cloud computing environment is an effective method for the elficient application of the massive data in the information age. In the process of the big data mining, the method o f the big data feature mining based on the gradient sampling has the poor logicality. It only mines the big data features from a single-level perspective, which reduces the precision of the big data feature mining.
文摘To promote behavioral change among adolescents in Zambia, the National HIV/AIDS/STI/TB Council, in collaboration with UNICEF, developed the Zambia U-Report platform. This platform provides young people with improved access to information on various Sexual Reproductive Health topics through Short Messaging Service (SMS) messages. Over the years, the platform has accumulated millions of incoming and outgoing messages, which need to be categorized into key thematic areas for better tracking of sexual reproductive health knowledge gaps among young people. The current manual categorization process of these text messages is inefficient and time-consuming and this study aims to automate the process for improved analysis using text-mining techniques. Firstly, the study investigates the current text message categorization process and identifies a list of categories adopted by counselors over time which are then used to build and train a categorization model. Secondly, the study presents a proof of concept tool that automates the categorization of U-report messages into key thematic areas using the developed categorization model. Finally, it compares the performance and effectiveness of the developed proof of concept tool against the manual system. The study used a dataset comprising 206,625 text messages. The current process would take roughly 2.82 years to categorise this dataset whereas the trained SVM model would require only 6.4 minutes while achieving an accuracy of 70.4% demonstrating that the automated method is significantly faster, more scalable, and consistent when compared to the current manual categorization. These advantages make the SVM model a more efficient and effective tool for categorizing large unstructured text datasets. These results and the proof-of-concept tool developed demonstrate the potential for enhancing the efficiency and accuracy of message categorization on the Zambia U-report platform and other similar text messages-based platforms.
基金Natural Science Foundation of China (No.50974054)Doctoral Program Foundation of the Ministry of Education (No.20070460001)National Key Basic Research and Development Program (No.2012CB723103)
文摘The relation between mining pressure field-fracture field and gas emission of working face is analyzed, and the concept that there is a stress point (or strain point) among permeability of coal is presented. It is believed that the mutation of coal permeability caused by the sudden loading or unloading of working face roof as periodic weighting occurs is the main reason that a lot of gas pour into the working face. Based on the above concept, the relation is established among abutment pressure during periodie weighting, permeability of coal seam and gas emission, and relation graph is drawn. Then the loading and unloading features of coal at the moment of fracture and non-fracture of main roof are revealed. And finally it is presented that the process of sudden loading or unloading as periodic weighting occurs plays an important role in rupture propagation of coal, analytical movement of gas and gas emission.
文摘Single-feature methods are unable to effectively track a target in an underground coal mine video due to the high background noise, low and uneven illumination, and drastic light fluctuation in the video. In this study, we propose an underground coal mine personnel target tracking method using multi-feature joint sparse representation. First, with a particle filter framework, the global and local multiple features of the target template and candidate particles are extracted. Second, each of the candidate particles is sparsely represented by a dictionary template, and reconstruction is achieved after solving the sparse coefficient. Last, the particle with the lowest reconstruction error is deemed the tracking result. To validate the effectiveness of the proposed algorithm, we compare the proposed method with three commonly employed tracking algorithms. The results show that the proposed method is able to reliably track the target in various scenarios, such as occlusion and illumination change, which generates better tracking results and validates the feasibility and effectiveness of the proposed method.
基金Acknowledgments This work was supported by National Natural Science Foundation of China (Grant no. 60971089), National Electronic Development Foundation of China (Grant no. 2009537), Jilin Province Science and Tech- nology Department Project of China (Grant no. 20090502).
文摘Particle Swarm Optimization (PSO) is a popular and bionic algorithm based on the social behavior associated with bird flocking for optimization problems. To maintain the diversity of swarms, a few studies of multi-swarm strategy have been reported. However, the competition among swarms, reservation or destruction of a swarm, has not been considered further. In this paper, we formulate four rules by introducing the mechanism for survival of the fittest, which simulates the competition among the swarms. Based on the mechanism, we design a modified Multi-Swarm PSO (MSPSO) to solve discrete problems, which consists of a number of sub-swarms and a multi-swarm scheduler that can monitor and control each sub-swarm using the rules. To further settle the feature selection problems, we propose an Improved Feature Selection (1FS) method by integrating MSPSO, Support Vector Machines (SVM) with F-score method. The IFS method aims to achieve higher generalization capa- bility through performing kernel parameter optimization and feature selection simultaneously. The performance of the proposed method is compared with that of the standard PSO based, Genetic Algorithm (GA) based and the grid search based mcthods on 10 benchmark datasets, taken from UCI machine learning and StatLog databases. The numerical results and statistical analysis show that the proposed IFS method performs significantly better than the other three methods in terms of prediction accuracy with smaller subset of features.
文摘This paper proposes one method of feature selection by using Bayes' theorem. The purpose of the proposed method is to reduce the computational complexity and increase the classification accuracy of the selected feature subsets. The dependence between two attributes (binary) is determined based on the probabilities of their joint values that contribute to positive and negative classification decisions. If opposing sets of attribute values do not lead to opposing classification decisions (zero probability), then the two attributes are considered independent of each other, otherwise dependent, and one of them can be removed and thus the number of attributes is reduced. The process must be repeated on all combinations of attributes. The paper also evaluates the approach by comparing it with existing feature selection algorithms over 8 datasets from University of California, Irvine (UCI) machine learning databases. The proposed method shows better results in terms of number of selected features, classification accuracy, and running time than most existing algorithms.
文摘The rapid growth of social networks has produced an unprecedented amount of user-generated data, which provides an excellent opportunity for text mining. Authorship analysis, an important part of text mining, attempts to learn about the author of the text through subtle variations in the writing styles that occur between gender, age and social groups. Such information has a variety of applications including advertising and law enforcement. One of the most accessible sources of user-generated data is Twitter, which makes the majority of its user data freely available through its data access API. In this study we seek to identify the gender of users on Twitter using Perceptron and Nai ve Bayes with selected 1 through 5-gram features from tweet text. Stream applications of these algorithms were employed for gender prediction to handle the speed and volume of tweet traffic. Because informal text, such as tweets, cannot be easily evaluated using traditional dictionary methods, n-gram features were implemented in this study to represent streaming tweets. The large number of 1 through 5-grams requires that only a subset of them be used in gender classification, for this reason informative n-gram features were chosen using multiple selection algorithms. In the best case the Naive Bayes and Perceptron algorithms produced accuracy, balanced accuracy, and F-measure above 99%.
基金supported by the National Key Research and Development Program of China (2016YFC0501107)the Project of Ordos Science and Technology Program (2017006)the Special Project of Science and Technology Basic Work of Ministry of Science and Technology of China (2014FY110800)
文摘It is known that the exploitation of opencast coal mines has seriously damaged the environments in the semi-arid areas.Vegetation status can reliably reflect the ecological degeneration and restoration in the opencast mining areas in the semi-arid areas.Long-time series MODIS NDVI data are widely used to simulate the vegetation cover to reflect the disturbance and restoration of local ecosystems.In this study, both qualitative(linear regression method and coefficient of variation(CoV)) and quantitative(spatial buffer analysis, and change amplitude and the rate of change in the average NDVI) analyses were conducted to analyze the spatio-temporal dynamics of vegetation during 2000–2017 in Jungar Banner of Inner Mongolia Autonomous Region, China, at the large(Jungar Banner and three mine groups) and small(three types of functional areas: opencast coal mining excavation areas, reclamation areas and natural areas) scales.The results show that the rates of change in the average NDVI in the reclamation areas(20%–60%) and opencast coal mining excavation areas(10%–20%) were considerably higher than that in the natural areas(<7%).The vegetation in the reclamation areas experienced a trend of increase(3–5 a after reclamation)-decrease(the sixth year of reclamation)-stability.The vegetation in Jungar Banner has a spatial heterogeneity under the influences of mining and reclamation activities.The ratio of vegetation improvement area to vegetation degradation area in the west, southwest and east mine groups during 2000–2017 was 8:1, 20:1 and 33:1, respectively.The regions with the high CoV of NDVI above 0.45 were mainly distributed around the opencast coal mining excavation areas, and the regions with the CoV of NDVI above 0.25 were mostly located in areas with low(28.8%) and medium-low(10.2%) vegetation cover.The average disturbance distances of mining activities on vegetation in the three mine groups(west, southwest and east) were 800, 800 and 1000 m, respectively.The greater the scale of mining, the farther the disturbance distances of mining activities on vegetation.We conclude that vegetation reclamation will certainly compensate for the negative impacts of opencast coal mining activities on vegetation.Sufficient attention should be paid to the proportional allocation of plant species(herbs and shrubs) in the reclamation areas, and the restored vegetation in these areas needs to be protected for more than 6 a.Then, as the repair time increased, the vegetation condition of the reclamation areas would exceed that of the natural areas.
基金Project supported by the National Natural Science Foundation ofChina (No. 60075007) and the National Basic Research Program(973) of China (No. G1998030401)
文摘Mutual information is an important information measure for feature subset. In this paper, a hashing mechanism is proposed to calculate the mutual information on the feature subset. Redundancy-synergy coefficient, a novel redundancy and synergy measure of features to express the class feature, is defined by mutual information. The information maximization rule was applied to derive the heuristic feature subset selection method based on mutual information and redundancy-synergy coefficient. Our experiment results showed the good performance of the new feature selection method.
文摘In the area of pattern recognition and machine learning,features play a key role in prediction.The famous applications of features are medical imaging,image classification,and name a few more.With the exponential growth of information investments in medical data repositories and health service provision,medical institutions are collecting large volumes of data.These data repositories contain details information essential to support medical diagnostic decisions and also improve patient care quality.On the other hand,this growth also made it difficult to comprehend and utilize data for various purposes.The results of imaging data can become biased because of extraneous features present in larger datasets.Feature selection gives a chance to decrease the number of components in such large datasets.Through selection techniques,ousting the unimportant features and selecting a subset of components that produces prevalent characterization precision.The correct decision to find a good attribute produces a precise grouping model,which enhances learning pace and forecast control.This paper presents a review of feature selection techniques and attributes selection measures for medical imaging.This review is meant to describe feature selection techniques in a medical domainwith their pros and cons and to signify its application in imaging data and data mining algorithms.The review reveals the shortcomings of the existing feature and attributes selection techniques to multi-sourced data.Moreover,this review provides the importance of feature selection for correct classification of medical infections.In the end,critical analysis and future directions are provided.
文摘Spam is no longer just commercial unsolicited email messages that waste our time, it consumes network traffic and mail servers’ storage. Furthermore, spam has become a major component of several attack vectors including attacks such as phishing, cross-site scripting, cross-site request forgery and malware infection. Statistics show that the amount of spam containing malicious contents increased compared to the one advertising legitimate products and services. In this paper, the issue of spam detection is investigated with the aim to develop an efficient method to identify spam email based on the analysis of the content of email messages. We identify a set of features that have a considerable number of malicious related features. Our goal is to study the effect of these features in helping the classical classifiers in identifying spam emails. To make the problem more challenging, we developed spam classification models based on imbalanced data where spam emails form the rare class with only 16.5% of the total emails. Different metrics were utilized in the evaluation of the developed models. Results show noticeable improvement of spam classification models when trained by dataset that includes malicious related features.
基金supported by the National Natural Science Foundation of China(6113900261501229+1 种基金11547040)the Guangdong Natural Science Foundation(2016A030310051)
文摘Student performance prediction helps the educational stakeholders to take proactive decisions and make interventions,for the improvement of quality of education and to meet the dynamic needs of society.The selection of features for student’s performance prediction not only plays significant role in increasing prediction accuracy,but also helps in building the strategic plans for the improvement of students’academic performance.There are different feature selection algorithms for predicting the performance of students,however the studies reported in the literature claim that there are different pros and cons of existing feature selection algorithms in selection of optimal features.In this paper,a hybrid feature selection framework(using feature-fusion)is designed to identify the significant features and associated features with target class,to predict the performance of students.The main goal of the proposed hybrid feature selection is not only to improve the prediction accuracy,but also to identify optimal features for building productive strategies for the improvement in students’academic performance.The key difference between proposed hybrid feature selection framework and existing hybrid feature selection framework,is two level feature fusion technique,with the utilization of cosine-based fusion.Whereas,according to the results reported in existing literature,cosine similarity is considered as the best similarity measure among existing similarity measures.The proposed hybrid feature selection is validated on four benchmark datasets with variations in number of features and number of instances.The validated results confirm that the proposed hybrid feature selection framework performs better than the existing hybrid feature selection framework,existing feature selection algorithms in terms of accuracy,f-measure,recall,and precision.Results reported in presented paper show that the proposed approach gives more than 90%accuracy on benchmark dataset that is better than the results of existing approach.
基金Supported by National Natural Science Foundation of China(60575036),Natural Science Foundation of Heilongjiang Province of China(F0316),the Science and Technology Foundation for Innovative Talents of Harbin City of China(2007RFXXG023),and the Science Foundation for Top Talents with the Spirit of Innovation of Harbin University of Science and Technology