The character variable discrete numeralization destroyed the disorder of character variables. As text classification problem contains more character variable, discrete numeralization approach affects the classificatio...The character variable discrete numeralization destroyed the disorder of character variables. As text classification problem contains more character variable, discrete numeralization approach affects the classification performance of classifiers. In this paper, we propose a character variable numeralization algorithm based on dimension expanding. Firstly, the algorithm computes the number of different values which the character variable takes. Then it replaces the original values with the natural bases in the m-dimensional Euclidean space. Though the algorithm causes a dimension expanding, it reserves the disorder of character variables because the natural bases are no difference in size, so this algorithm is a better character variable numerical processing algorithm. Experiments on text classification data sets show that though the proposed algorithm costs a little more running time, its classification performance is better.展开更多
基金This work is sponsored by the National Natural Science Foundation of China (Nos. 61402246, 61402126, 61370083, 61370086, 61303193, and 61572268), a Project of Shandong Province Higher Educational Science and Technology Program (No. J15LN38), Qingdao indigenous innovation program (No. 15-9-1-47-jch), the National Research Foundation for the Doctoral Program of Higher Education of China (No. 20122304110012), the Natural Science Foundation of Heilongjiang Province of China (No. F201101), the Science and Technology Research Project Foundation of Heilongjiang Province Education Department (No. 12531105), Heilongjiang Province Postdoctoral Research Start Foundation (No. LBH-Q13092), and the National Key Technology R&D Program of the Ministry of Science and Technology under Grant No. 2012BAH81F02.
文摘The character variable discrete numeralization destroyed the disorder of character variables. As text classification problem contains more character variable, discrete numeralization approach affects the classification performance of classifiers. In this paper, we propose a character variable numeralization algorithm based on dimension expanding. Firstly, the algorithm computes the number of different values which the character variable takes. Then it replaces the original values with the natural bases in the m-dimensional Euclidean space. Though the algorithm causes a dimension expanding, it reserves the disorder of character variables because the natural bases are no difference in size, so this algorithm is a better character variable numerical processing algorithm. Experiments on text classification data sets show that though the proposed algorithm costs a little more running time, its classification performance is better.