摘要
κ-近邻作为一种简单、有效、非参数的分类方法,在文本分类中得到广泛的应用,但是这种方法计算量较大。针对κ-近邻法的不足之处,提出了一种新的快速文本分类方法,通过对原始训练样本集的训练生成代表样本,再根据原始训练样本与已生成代表样本之间的分布状况,对已生成的代表样本进行多次调整,从而使代表样本更具有代表性。这种方法有效地压缩了原始训练样本集,提高了分类效率;同时,由于代表样本的分布更加合理,可以提高分类的准确性。实验结果显示,此方法具有很好的分类性能。
As a simple, effective and nonparametric classification method, k- Nearest Neighbor method is widely used in text classification, but it has large computational demands. In this paper a new fast text classification approach is proposed to solve the problem. The method generates representative samples through training the original samples, and then adjusts the representative samples repeatedly for enhancing its representative ability according to the distribution of the original training samples and generated representative samples. By using this approach, the original training corpus can be compressed effectively so that the classification efficiency can be improved substantially. Meanwhile, this approach makes the distribution of representative samples more even, so the classification performance can be improved. Experiments also show that this approach has a good performance.
出处
《计算机仿真》
CSCD
2007年第6期322-325,共4页
Computer Simulation
基金
国家自然科学基金(60204009)
中科院复杂系统与智能科学重点实验室开放基金(20040104)
973课题(2004CB318109)。
关键词
文本分类
代表样本
快速分类
Text classification
Representative samples
Fast classification