期刊文献+

kNN文本分类器类偏斜问题的一种处理对策 被引量:33

A Strategy to Class Imbalance Problem for kNN Text Classifier
下载PDF
导出
摘要 类偏斜问题(class i mbalance problem)是数据挖掘领域的常见问题之一,人们提出了各种策略来处理这个问题.当训练样本存在类偏斜问题时,kNN分类器会将小类中的样本错分到大类,导致分类的宏F1指标下降.针对kNN存在的这个缺陷,提出了文本训练集的临界点(critical point,CP)的概念并对其性质进行了探讨,给出了求CP,CP的下近似值LA、上近似值UA的算法.之后,根据LA或UA及训练样本数对传统的kNN决策函数进行修改,这就是自适应的加权kNN文本分类.为了验证自适应的加权kNN文本分类的有效性,设计了2组实验进行对比:一组为不同的收缩因子间进行对比,可看做是与Tan的工作进行对比,同时用来证实在LA或UA上分类器的宏F1较好;另一组则是与随机重取样进行实验对比,其中,传统kNN方法作为对比的基线.实验表明,所提的自适应加权kNN文本分类优于随机重取样,使得宏F1指标明显上升.该方法有点类似于代价相关学习. Class imbalance is one of the problems plagueing practitioners in data mining community. First, some strategies to deal with this problem are reviewed. When training set is skewed, the popular kNN text classifier will mislabel instances in rare categories into common ones and lead to degradation in macro F1. To alleviate such a misfortune, a novel concept, critical point (CP) of the text training set, is proposed. Then property of CP is explored and algorithm evaluating the lower approximation (LA) and upper approximation (UA) of CP is given. Afterwards, traditional kNN is adapted by integrating LA or UA, training number with decision functions. This version of kNN is called self-adaptive kNN classifier with weight adjustment. To verify self-adaptive kNN classifier with weight adjustment feasible, two groups of experiments are carried out to compare with it. The first group is to compare the performance of different shrink factors, which can be viewed as comparing with Tan's work, and to prove that at LA or UA, the classifier will exhibit better Macro F1. The second group is to compare with random-sampling, where traditional kNN is used as a baseline. Experiments on four corpora illustrate that self-adaptive kNN text classifier with weight adjustment is better than random re-sampling, improving macro F1 evidently. The proposed method is similar to cost-sensitive learning to some extent.
出处 《计算机研究与发展》 EI CSCD 北大核心 2009年第1期52-61,共10页 Journal of Computer Research and Development
基金 国家自然科学基金重大项目(60736016)~~
关键词 文本分类 KNN 类偏斜 文本训练集的临界点 权重调节 随机重取样 text classification kNN class imbalance critical point of the text training set weight adjustment random re-sampling
  • 相关文献

参考文献21

  • 1Japkowicz N. Learning from imbalanced data sets: A comparison of various strategies, WS-00-05 [R]. Menlo Park, CA: AAAI Press, 2000
  • 2Chawla N V, Japkowicz N, Kotcz A. Editorial: Special issue on learning from imbalaneed data sets [J]. Sigkdd Explorations Newsletters, 2004, 6( 1 ) : 1-6
  • 3Weiss Gary M. Mining with rarity: A unifying frameworks [J]. SIGKDD Explorations Newsletters, 2004, 6(1): 7-19
  • 4Maloof M A. Learning when data sets are imbalanced and when costs are unequal and unknown [OL]. [2008-01-06]. http://www. site. uottawa. ca/-nat/workshop2003/workshop 2003. html
  • 5Chawla N V, Hall L O, Bowyer K W, et al. SMOTE: Synthetic minority oversampling technique [J]. Journal of Artificial Intelligence Research, 2002, 16 : 321-357
  • 6Jo Taeho, Japkowicz Nathalie. Class imbalances versus small disjunets [J]. SIGKDD Explorations Newsletters, 2004, 6 (1): 40-49
  • 7Batista E A P A, Prati R C, Monard M C. A study of the behavior of several methods for halaneing machine learning training data [J]. SIGKDD Explorations Newsletters, 2004, 6(1): 20-29
  • 8Guo Hongyu, Viktor Herna L. Learning from imbalanced data sets with boosting and data generation: The DataBoostIM approach [J]. SIGKDD Explorations Newsletters, 2004, 6(1): 30-39
  • 9Chawla N V, Lazarevic A, Hall L O, et al. Smoteboost: Improving prediction of the minority class in boosting [C] // Proc of the Seventh European Conf on Principles and Practice of Knowledge Discovery in Databases. Berlin: Springer, 2003:107-119
  • 10Phua C, Alahakoon D, Lee V. Minority Report in Fraud Detection: Classification of Skewed Data [J]. SIGKDD Explorations Newsletters, 2004, 6(1): 50-59

二级参考文献16

  • 1王建会,王洪伟,申展,胡运发.一种实用高效的文本分类算法[J].计算机研究与发展,2005,42(1):85-93. 被引量:20
  • 2李荣陆,王建会,陈晓云,陶晓鹏,胡运发.使用最大熵模型进行中文文本分类[J].计算机研究与发展,2005,42(1):94-101. 被引量:95
  • 3[1]D D Lewis. Naive (Bayes) at forty: The independence assumption in information retrieval. In: The 10th European Conf on Machine Learning(ECML98), New York: Springer-Verlag, 1998. 4~15
  • 4[2]Y Yang, X Lin. A re-examination of text categorization methods. In: The 22nd Annual Int'l ACM SIGIR Conf on Research and Development in Information Retrieval, New York: ACM Press, 1999
  • 5[3]Y Yang, C G Chute. An example-based mapping method for text categorization and retrieval. ACM Trans on Information Systems, 1994, 12(3): 252~277
  • 6[4]E Wiener. A neural network approach to topic spotting. The 4th Annual Symp on Document Analysis and Information Retrieval (SDAIR 95), Las Vegas, NV, 1995
  • 7[5]R E Schapire, Y Singer. Improved boosting algorithms using confidence-rated predications. In: Proc of the 11th Annual Conf on Computational Learning Theory. Madison: ACM Press, 1998. 80~91
  • 8[6]T Joachims. Text categorization with support vector machines: Learning with many relevant features. In: The 10th European Conf on Machine Learning (ECML-98). Berlin: Springer, 1998. 137~142
  • 9[7]S O Belkasim, M Shridhar, M Ahmadi. Pattern classification using an efficient KNNR. Pattern Recognition Letter, 1992, 25(10): 1269~1273
  • 10[8]V E Ruiz. An algorithm for finding nearest neighbors in (approximately) constant average time. Pattern Recognition Letter, 1986, 4(3): 145~147

共引文献608

同被引文献366

引证文献33

二级引证文献269

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部