期刊文献+

基于密度的kNN分类器训练样本裁剪方法的改进 被引量:13

Improvement of density-based method for reducing training data in KNN text classification
下载PDF
导出
摘要 在文本分类中,训练集的分布状态会直接影响k-近邻(kNN)分类器的效率和准确率。通过分析基于密度的kNN文本分类器训练样本的裁剪方法,发现它存在两大不足:一是裁剪之后的均匀状态只是以ε为半径的球形区域意义上的均匀状态,而非最理想的均匀状态即两两样本之间的距离相等;二是未对低密度区域的样本做任何处理,裁剪之后仍存在大量不均匀的区域。针对这两处不足,提出了以下两点改进:一是优化了裁剪策略,使裁剪之后的训练集更趋于理想的均匀状态;二是实现了对低密度区域样本的补充。通过实验对比,改进后的方法在稳定性和准确率方面都有明显提高。 The density of training data directly influences the efficiency and precision of k-Nearest Neighbor(kNN)text classifier.Two disadvantages had been uncovered by the analysis of density-based method while reducing the amount of training data in kNN text classification.One is that after being reduced,the even density of the training data is just based on the spherical region which has a radius of ε,rather than the equal distance of every training text.The other is that there is no treatment of the low-density training texts while plenty of low-density texts still exist in the training data after being reduced.An improved approach to the mentioned deficiencies was proposed:the reduction strategy was optimized to make the training data yield evenly and the appropriate data were supplemented into the low-density texts.It is shown that the improved method has a distinctly better performance on both algorithm stability and accuracy.
出处 《计算机应用》 CSCD 北大核心 2010年第3期799-801,817,共4页 journal of Computer Applications
基金 中国博士后科学基金资助项目(20070420711) 重庆市科委自然科学基金资助项目(2007BB2372)
关键词 文本分类 K-近邻 快速分类 样本裁剪 样本补充 text classification k-Nearest Neighbor(kNN) fast classification sample reduction sample supplement
  • 相关文献

参考文献11

  • 1张宁,贾自艳,史忠植.使用KNN算法的文本分类[J].计算机工程,2005,31(8):171-172. 被引量:99
  • 2李杨,曾海泉,刘庆华,胡运发.基于kNN的快速WEB文档分类[J].小型微型计算机系统,2004,25(4):725-729. 被引量:13
  • 3王煜,白石,王正欧.用于Web文本分类的快速KNN算法[J].情报学报,2007,26(1):60-64. 被引量:33
  • 4RUIZ V E.An algorithm for finding nearest neighbors in (approximately) constant average time[J].Pattern Recognition Letter,1986,4(3):145-147.
  • 5HART P E.The condensed nearest neighbor rule[J].IEEE Transactions on Information Theory,1968,IT214(3):515-516.
  • 6WILSON D L.Asymptotic properties of nearest neighbor rules using edited data[J].IEEE Transactions on Systems,Man and Cybernetics,1972,2(3):408-421.
  • 7DEVIJVER P,KITTLER J.Pattern recognition:A statistical approach[M].Englewood Cliffs:Prentice Hall,1982.
  • 8KUNCHEVA L I.Fitness functions in editing KNN reference set by genetic algorithms[J].Pattern Recognition,1997,30(6):1041-1049.
  • 9李荣陆,胡运发.基于密度的kNN文本分类器训练样本裁剪方法[J].计算机研究与发展,2004,41(4):539-545. 被引量:98
  • 10FAGN YUAN,LIU YANG.A new density-based method for reducing the amount of training data in k-NN text classification[C]// Proceedings of the 6th International Conference on Machine Learning and Cybernetics.Hong Kong:[s.n.],2007:3372-3376.

二级参考文献40

  • 1王煜,王正欧.基于模糊决策树的文本分类规则抽取[J].计算机应用,2005,25(7):1634-1637. 被引量:13
  • 2[1]Yang Y and Liu X. A re-examination of text categorization methods[C]. In: Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 1999, 42~49.
  • 3[2]Dasarathy B V. Neatest neighbor(NN) norms: NN pattern classification techniques[C]. Los Alamitos, CA:IEEE Computer Society Press, 1991.
  • 4[3]Harrt P E. The condensed nearest neighbor rule[J]. IEEE Trans. Information Theory ,May 1968,IT-14(3):515~516.
  • 5[4]Dasarathy Y, Minimal B V. Consistent set (MCS) identification for optimal nearest neighbor decision system terms design[J]. IEEE Trans. Syst. Man Cybern. ,March 1994,24(3):511~517.
  • 6[5]Kuncheva L I. Fitness functions in editing K-NN reference set by genetic algorithms[J]. Pattern Rcognition,1997,30(6):1041~1049.
  • 7[6]Zhong Hong-bin, Sun Guang-yu. Optimal selection of & Technology, May 2001,16(2): 126~136.reference set for the nearest neighbor classification by Tabu search[J]. Journal of Computer Science
  • 8[7]Masand B, Linoff G and Waltz D. Classifying news stories using memory-based reasoning[C]. In: International ACM SIGIR Conference on Research and Development in Information Retrieval, 1992, 59~65.
  • 9[8]Yang Y. Expert network: effective and efficient learning from human decisions in text categorization and retrieval[C]. In:Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'94) 1994:11~21.
  • 10[9]Iwayama M and Tokunaga T. Cluster-based text categorization: a comparison of category search strategies[C]. In: Proceedings of the 18th Ann Int ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'95), 1995, 273~281.

共引文献229

同被引文献84

引证文献13

二级引证文献107

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部