摘要
特征降维是文本分类面临的主要问题之一。首先通过x^2分布对特征项进行选择,然后使用一种改进的基于密度聚类方法对选择后的特征项进行聚类,借助类别分布信息,在尽量减少信息缺失的前提下先后两次对文本特征维数进行了压缩;在基于类别概率分布的模式下实现文本的矩阵表示,借助矩阵理论进行文本分类。试验结果表明,该方法的分类效率较高。
The feature reduction is one of the main problems in text classification ,Firstly, the authors select features by using CHI distribution. Secondly,the authors cluster the selected features by using an improved method which based on density dustering. In virtue of the sort distribution information, the authors reduce the number of features twice and the information lost few, Lastly, based on the sort of texts, the authors use the distributing of probability to express text with matrix, and realizes the text categorization by using matrix norm. The experiment shows that this method has a higher precision for the text classification.
出处
《图书情报工作》
CSSCI
北大核心
2008年第1期73-76,共4页
Library and Information Service
基金
国家自然科学基金资助项目“基于不完全信息的交互式群决策理论及其应用”(项目编号:70571087)研究成果之一
关键词
文本分类
特征选择
特征聚类
Bayes分布
文本表示
text categorization feature clustering bayes distributing text expressing feature selection