摘要
本文在对文本分类的问题,关键技术及系统结构进行介绍的基础上,详细阐述了一种利用带动力项的BP神经网络作为分类器的中文文本自动分类方法。该法采用归一化TFIDF算法对特征向量进行权值计算,并使用期望交叉熵统计方法对特征向量集进行精简。此外,我们在TanCorp12数据集上测试了特征项数目和训练次数对于分类器的宏平均和微平均性能的影响。
This paper has illustrated the description of the Chinese text categorization problem, the key technology and system design, and base on that, this paper explains the method how to use BP artificial network( with momentum) to achieve the goal of automatically classifying Chinese texts into different categories. The method adopts the TF - IDF formula to calculate weight and uses Expected Cross Entropy method as a way of reducing space dimension. Finally,on the TanCorpl2 text set, we use macro- average F1 and micro- average F1 as evaluation criterion to test the impact of parameters, such as input node number,training times, on the performance of the classifier.
出处
《微计算机应用》
2008年第3期31-36,共6页
Microcomputer Applications
基金
国家自然科学基金重大项目(No.60496322)
北京市组织部优秀人才(No.2005D0501508)
北京工业大学校青基金
关键词
文本分类
BP神经网络
特征降维
text categorization, BP neural network, feature reduction