摘要
随着Internet上文档信息的迅猛发展,文本分类成为处理和组织大量文档数据的关键技术.由于χ2统计量能很好地体现词和类别之间的相关性,因此成为特征选择中常用的评估函数.本文分析了χ2统计量在特征选择和分类决策阶段的性质,提出了一种新的基于χ2统计量的相似度定义,并结合基于两次类别判定的快速搜索算法,改进了传统的kNN算法.实验结果显示基于χ2统计量的改进kNN文本分类算法能大大减少kNN算法的分类时间,并提高了kNN算法的准确率和召回率.
With the rapid development of online information, text classification has become the key technology in processing and organizing large amount of document data. Χ^2 statistic is a widely used evaluation function in feature selection since it measures the lack of independence between a term and a class effectively. This paper proposed a new similarity based on Χ^2 statistic and a hybrid classification mechanism, and then applied them to improve the traditional kNN. Experiments show that the new method can reduce test time greatly and improve the precision and recall compared with traditional kNN. Its performance is higher than traditional kNN and comparable with SVMTorch.
出处
《小型微型计算机系统》
CSCD
北大核心
2007年第6期1094-1097,共4页
Journal of Chinese Computer Systems
基金
国家自然科学基金项目(60573097)资助
广东省自然科学基金项目(05200302
06104916)资助
国家科技(2004BA721A02)资助
广东省科技计划项目(2005B10101032)资助
高等学校博士学科点专项科研基金项目(20050558017)资助.