摘要
提出了一种快速的文本倾向性分类方法,即采用类别空间模型描述词语对类别的倾向性,基于词的统计特征实现分类;针对倾向性分类的复杂性,在综合考虑词频、词的文本频、词的分布三种统计特征的基础上,提出一种新的二次特征提取方法:第一次特征提取,采用组合特征提取方法,除去低频词以及在各类中均匀分布的噪音词;第二次特征提取,去除类别倾向性不明显的词。实验表明该分类方法不仅具有较高的分类性能,而且运行速度快,在信息检索、信息过滤、内容安全管理等方面具有一定的实用价值。
A rapid method for text tendency classification is proposed in this paper. By means of class space model to display the tendency of the words to the categories, the method realizes the classification based on the statistic characteristics of words. In this method, through the studies of the complexity of text tendency categorization, three statistic characteristics of word such as frequency, document frequency and the distribution of words are comprehensively taken into account, and a new method of twice feature selection is proposed: In the first characteristic selection process, using combination characteristic selection method, the words that those distributions are uniform in each category and the low-frequency words are deleted. Then in the second process, the words that those category tendencies are not obvious are deleted. The experimental results show that the algorithm is running-fast, and has high performance.
出处
《电子科技大学学报》
EI
CAS
CSCD
北大核心
2007年第6期1232-1236,共5页
Journal of University of Electronic Science and Technology of China
基金
国家863计划项目(2005AA147030)~~
关键词
类别权重
类别空间模型
文本倾向性分类
二次特征提取
category weight
class space model
text tendency categorization
twice feature selection