Feature selection is a crucial technique in text classification for improving the efficiency and effectiveness of classifiers or machine learning techniques by reducing the dataset’s dimensionality.This involves elim...Feature selection is a crucial technique in text classification for improving the efficiency and effectiveness of classifiers or machine learning techniques by reducing the dataset’s dimensionality.This involves eliminating irrelevant,redundant,and noisy features to streamline the classification process.Various methods,from single feature selection techniques to ensemble filter-wrapper methods,have been used in the literature.Metaheuristic algorithms have become popular due to their ability to handle optimization complexity and the continuous influx of text documents.Feature selection is inherently multi-objective,balancing the enhancement of feature relevance,accuracy,and the reduction of redundant features.This research presents a two-fold objective for feature selection.The first objective is to identify the top-ranked features using an ensemble of three multi-univariate filter methods:Information Gain(Infogain),Chi-Square(Chi^(2)),and Analysis of Variance(ANOVA).This aims to maximize feature relevance while minimizing redundancy.The second objective involves reducing the number of selected features and increasing accuracy through a hybrid approach combining Artificial Bee Colony(ABC)and Genetic Algorithms(GA).This hybrid method operates in a wrapper framework to identify the most informative subset of text features.Support Vector Machine(SVM)was employed as the performance evaluator for the proposed model,tested on two high-dimensional multiclass datasets.The experimental results demonstrated that the ensemble filter combined with the ABC+GA hybrid approach is a promising solution for text feature selection,offering superior performance compared to other existing feature selection algorithms.展开更多
Feature selection is an important approach to dimensionality reduction in the field of text classification. Because of the difficulty in handling the problem that the selected features always contain redundant informa...Feature selection is an important approach to dimensionality reduction in the field of text classification. Because of the difficulty in handling the problem that the selected features always contain redundant information, we propose a new simple feature selection method, which can effectively filter the redundant features. First, to calculate the relationship between two words, the definitions of word frequency based relevance and correlative redundancy are introduced. Furthermore, an optimal feature selection(OFS) method is chosen to obtain a feature subset FS1. Finally, to improve the execution speed, the redundant features in FS1 are filtered by combining a predetermined threshold, and the filtered features are memorized in the linked lists. Experiments are carried out on three datasets(Web KB, 20-Newsgroups, and Reuters-21578) where in support vector machines and na?ve Bayes are used. The results show that the classification accuracy of the proposed method is generally higher than that of typical traditional methods(information gain, improved Gini index, and improved comprehensively measured feature selection) and the OFS methods. Moreover, the proposed method runs faster than typical mutual information-based methods(improved and normalized mutual information-based feature selections, and multilabel feature selection based on maximum dependency and minimum redundancy) while simultaneously ensuring classification accuracy. Statistical results validate the effectiveness of the proposed method in handling redundant information in text classification.展开更多
In order to solve the poor performance in text classification when using traditional formula of mutual information (MI) , a feature selection algorithm were proposed based on improved mutual information. The improve...In order to solve the poor performance in text classification when using traditional formula of mutual information (MI) , a feature selection algorithm were proposed based on improved mutual information. The improved mutual information algorithm, which is on the basis of traditional improved mutual information methods that enbance the MI value of negative characteristics and feature' s frequency, supports the concept of concentration degree and dispersion degree. In accordance with the concept of concentration degree and dispersion degree, formulas which embody concentration degree and dispersion degree were constructed and the improved mutual information was implemented based on these. In this paper, the feature selection algorithm was applied based on improved mutual information to a text classifier based on Biomimetic Pattern Recognition and it was compared with several other feature selection methods. The experimental results showed that the improved mutu- al information feature selection method greatly enhances the performance compared with traditional mutual information feature selection methods and the performance is better than that of information gain. Through the introduction of the concept of concentration degree and dispersion degree, the improved mutual information feature selection method greatly improves the performance of text classification system.展开更多
Recent studies have revealed that emerging modern machine learning techniques are advantageous to statistical models for text classification, such as SVM. In this study, we discuss the applications of the support vect...Recent studies have revealed that emerging modern machine learning techniques are advantageous to statistical models for text classification, such as SVM. In this study, we discuss the applications of the support vector machine with mixture of kernel (SVM-MK) to design a text classification system. Differing from the standard SVM, the SVM-MK uses the 1-norm based object function and adopts the convex combinations of single feature basic kernels. Only a linear programming problem needs to be resolved and it greatly reduces the computational costs. More important, it is a transparent model and the optimal feature subset can be obtained automatically. A real Chinese corpus from FudanUniversityis used to demonstrate the good performance of the SVM- MK.展开更多
We explore the techniques of utilizing N gram information to categorize Chinese text documents hierarchically so that the classifier can shake off the burden of large dictionaries and complex segmentation process...We explore the techniques of utilizing N gram information to categorize Chinese text documents hierarchically so that the classifier can shake off the burden of large dictionaries and complex segmentation processing, and subsequently be domain and time independent. A hierarchical Chinese text classifier is implemented. Experimental results show that hierarchically classifying Chinese text documents based N grams can achieve satisfactory performance and outperforms the other traditional Chinese text classifiers.展开更多
基金supported by Universiti Sains Malaysia(USM)and School of Computer Sciences,USM。
文摘Feature selection is a crucial technique in text classification for improving the efficiency and effectiveness of classifiers or machine learning techniques by reducing the dataset’s dimensionality.This involves eliminating irrelevant,redundant,and noisy features to streamline the classification process.Various methods,from single feature selection techniques to ensemble filter-wrapper methods,have been used in the literature.Metaheuristic algorithms have become popular due to their ability to handle optimization complexity and the continuous influx of text documents.Feature selection is inherently multi-objective,balancing the enhancement of feature relevance,accuracy,and the reduction of redundant features.This research presents a two-fold objective for feature selection.The first objective is to identify the top-ranked features using an ensemble of three multi-univariate filter methods:Information Gain(Infogain),Chi-Square(Chi^(2)),and Analysis of Variance(ANOVA).This aims to maximize feature relevance while minimizing redundancy.The second objective involves reducing the number of selected features and increasing accuracy through a hybrid approach combining Artificial Bee Colony(ABC)and Genetic Algorithms(GA).This hybrid method operates in a wrapper framework to identify the most informative subset of text features.Support Vector Machine(SVM)was employed as the performance evaluator for the proposed model,tested on two high-dimensional multiclass datasets.The experimental results demonstrated that the ensemble filter combined with the ABC+GA hybrid approach is a promising solution for text feature selection,offering superior performance compared to other existing feature selection algorithms.
基金Project supported by the Joint Funds of the National Natural Science Foundation of China(No.U1509214)the Beijing Natural Science Foundation,China(No.4174105)+1 种基金the Key Projects of National Bureau of Statistics of China(No.2017LZ05)the Discipline Construction Foundation of the Central University of Finance and Economics,China(No.2016XX02)
文摘Feature selection is an important approach to dimensionality reduction in the field of text classification. Because of the difficulty in handling the problem that the selected features always contain redundant information, we propose a new simple feature selection method, which can effectively filter the redundant features. First, to calculate the relationship between two words, the definitions of word frequency based relevance and correlative redundancy are introduced. Furthermore, an optimal feature selection(OFS) method is chosen to obtain a feature subset FS1. Finally, to improve the execution speed, the redundant features in FS1 are filtered by combining a predetermined threshold, and the filtered features are memorized in the linked lists. Experiments are carried out on three datasets(Web KB, 20-Newsgroups, and Reuters-21578) where in support vector machines and na?ve Bayes are used. The results show that the classification accuracy of the proposed method is generally higher than that of typical traditional methods(information gain, improved Gini index, and improved comprehensively measured feature selection) and the OFS methods. Moreover, the proposed method runs faster than typical mutual information-based methods(improved and normalized mutual information-based feature selections, and multilabel feature selection based on maximum dependency and minimum redundancy) while simultaneously ensuring classification accuracy. Statistical results validate the effectiveness of the proposed method in handling redundant information in text classification.
基金Sponsored by the National Nature Science Foundation Projects (Grant No. 60773070,60736044)
文摘In order to solve the poor performance in text classification when using traditional formula of mutual information (MI) , a feature selection algorithm were proposed based on improved mutual information. The improved mutual information algorithm, which is on the basis of traditional improved mutual information methods that enbance the MI value of negative characteristics and feature' s frequency, supports the concept of concentration degree and dispersion degree. In accordance with the concept of concentration degree and dispersion degree, formulas which embody concentration degree and dispersion degree were constructed and the improved mutual information was implemented based on these. In this paper, the feature selection algorithm was applied based on improved mutual information to a text classifier based on Biomimetic Pattern Recognition and it was compared with several other feature selection methods. The experimental results showed that the improved mutu- al information feature selection method greatly enhances the performance compared with traditional mutual information feature selection methods and the performance is better than that of information gain. Through the introduction of the concept of concentration degree and dispersion degree, the improved mutual information feature selection method greatly improves the performance of text classification system.
文摘Recent studies have revealed that emerging modern machine learning techniques are advantageous to statistical models for text classification, such as SVM. In this study, we discuss the applications of the support vector machine with mixture of kernel (SVM-MK) to design a text classification system. Differing from the standard SVM, the SVM-MK uses the 1-norm based object function and adopts the convex combinations of single feature basic kernels. Only a linear programming problem needs to be resolved and it greatly reduces the computational costs. More important, it is a transparent model and the optimal feature subset can be obtained automatically. A real Chinese corpus from FudanUniversityis used to demonstrate the good performance of the SVM- MK.
基金Supported by the China Postdoctoral Science Foundation
文摘We explore the techniques of utilizing N gram information to categorize Chinese text documents hierarchically so that the classifier can shake off the burden of large dictionaries and complex segmentation processing, and subsequently be domain and time independent. A hierarchical Chinese text classifier is implemented. Experimental results show that hierarchically classifying Chinese text documents based N grams can achieve satisfactory performance and outperforms the other traditional Chinese text classifiers.