期刊文献+

基于χ~2统计量的kNN文本分类算法 被引量:13

A kNN Text Categorization Algorithm Base on Χ^2 Statistic
下载PDF
导出
摘要 随着Internet上文档信息的迅猛发展,文本分类成为处理和组织大量文档数据的关键技术.由于χ2统计量能很好地体现词和类别之间的相关性,因此成为特征选择中常用的评估函数.本文分析了χ2统计量在特征选择和分类决策阶段的性质,提出了一种新的基于χ2统计量的相似度定义,并结合基于两次类别判定的快速搜索算法,改进了传统的kNN算法.实验结果显示基于χ2统计量的改进kNN文本分类算法能大大减少kNN算法的分类时间,并提高了kNN算法的准确率和召回率. With the rapid development of online information, text classification has become the key technology in processing and organizing large amount of document data. Χ^2 statistic is a widely used evaluation function in feature selection since it measures the lack of independence between a term and a class effectively. This paper proposed a new similarity based on Χ^2 statistic and a hybrid classification mechanism, and then applied them to improve the traditional kNN. Experiments show that the new method can reduce test time greatly and improve the precision and recall compared with traditional kNN. Its performance is higher than traditional kNN and comparable with SVMTorch.
作者 印鉴 谭焕云
出处 《小型微型计算机系统》 CSCD 北大核心 2007年第6期1094-1097,共4页 Journal of Chinese Computer Systems
基金 国家自然科学基金项目(60573097)资助 广东省自然科学基金项目(05200302 06104916)资助 国家科技(2004BA721A02)资助 广东省科技计划项目(2005B10101032)资助 高等学校博士学科点专项科研基金项目(20050558017)资助.
关键词 文本分类 特征选择 KNN Χ^2统计量 text categorization feature selection kNN Χ^2 statistic
  • 相关文献

参考文献4

二级参考文献18

  • 1[1]D D Lewis. Naive (Bayes) at forty: The independence assumption in information retrieval. In: The 10th European Conf on Machine Learning(ECML98), New York: Springer-Verlag, 1998. 4~15
  • 2[2]Y Yang, X Lin. A re-examination of text categorization methods. In: The 22nd Annual Int'l ACM SIGIR Conf on Research and Development in Information Retrieval, New York: ACM Press, 1999
  • 3[3]Y Yang, C G Chute. An example-based mapping method for text categorization and retrieval. ACM Trans on Information Systems, 1994, 12(3): 252~277
  • 4[4]E Wiener. A neural network approach to topic spotting. The 4th Annual Symp on Document Analysis and Information Retrieval (SDAIR 95), Las Vegas, NV, 1995
  • 5[5]R E Schapire, Y Singer. Improved boosting algorithms using confidence-rated predications. In: Proc of the 11th Annual Conf on Computational Learning Theory. Madison: ACM Press, 1998. 80~91
  • 6[6]T Joachims. Text categorization with support vector machines: Learning with many relevant features. In: The 10th European Conf on Machine Learning (ECML-98). Berlin: Springer, 1998. 137~142
  • 7[7]S O Belkasim, M Shridhar, M Ahmadi. Pattern classification using an efficient KNNR. Pattern Recognition Letter, 1992, 25(10): 1269~1273
  • 8[8]V E Ruiz. An algorithm for finding nearest neighbors in (approximately) constant average time. Pattern Recognition Letter, 1986, 4(3): 145~147
  • 9[9]P E Hart. The condensed nearest neighbor rule. IEEE Trans on Information Theory, 1968, IT-14(3): 515~516
  • 10[10]D L Wilson. Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans on Systems, Man and Cybernetics, 1972, 2(3): 408~421

共引文献246

同被引文献96

引证文献13

二级引证文献60

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部