期刊文献+

用于文本分类的改进KNN算法 被引量:15

An Improved KNN Algorithm Applied to Text Categorization
下载PDF
导出
摘要 最近邻分类器是假定局部的类条件概率不变,而这个假定在高维特征空间中无效。因此在高维特征空间中使用k最近邻分类器,不对特征权重进行修正就会引起严重的偏差。本文采用灵敏度法,利用前馈神经网络获得初始特征权重并进行二次降维。在初始权重下,根据样本间相似度采用SS树方法将训练样本划分成若干小区域,以此寻找待分类样本的近似k0个最近邻,并根据近似k0个最近邻和Chi-square距离原理计算新权重,搜索出新的k个最近邻。此方法在付出较小时间代价的情况下,在文本分离中可获得较好的分类精度的提高。 Nearest neighbor classification assumes locally constant class conditional probabilities. The assumption becomes invalid in feature space with high dimension. When KNN classier is used in feature space high dimension, severe bias can be introduced if the weights of features are not amended. In this paper, initial weights of text features are acquired based on sensitivity method firstly, and the second dimension reduce is done. Then training samples are divided into many groups based on sample similarity and the initial weights by using SS tree, k0 approximate nearest neighbors of unknown sample are acquired by using SS tree. Weights are computed again based on k0 approximate nearest neighbors and chi-square distance theory. K nearest neighbors are acquired based on new weights. Little time is spent, but the better accuracy of text categorization is acquired.
出处 《中文信息学报》 CSCD 北大核心 2007年第3期76-82,共7页 Journal of Chinese Information Processing
基金 国家自然科学基金资助项目(60275020)
关键词 计算机应用 中文信息处理 文本分类 神经网络 Chi—square距离 KNN算法 computer application Chinese information processing text categorization neural network Chi-square distance KNN algorithm
  • 相关文献

参考文献10

  • 1代六玲,黄河燕,陈肇雄.中文文本分类中特征抽取方法的比较研究[J].中文信息学报,2004,18(1):26-32. 被引量:228
  • 2Carlota Domeniconi, Jing Peng, Dimitrios Gunopulos.Locally Adaptive Metric Nearest-Neighbor Classification[J]. IEEE TRASACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIENGCE. 2002,24(9) : 1281-1285.
  • 3Jing Peng, Douglas R. Heisterkamp, H. K. Dai. LDA/SVM Driven Nearest Neighbor Classification [J].IEEE TRASACTIONS ON NEURAL NETWORKS.2003,14 (4) :940-942.
  • 4王晓晔,王正欧.K-最近邻分类技术的改进算法[J].电子与信息学报,2005,27(3):487-491. 被引量:24
  • 5Setiono R, Liu H. Neural network feature selector[J]. IEEE TRANSCATIONS ON NEURAL NETWORKS, 1977,8(3): 654-662.
  • 6David A. White, Ramesh Jain. Similarity indexing with the SS-tree[A]. In: Proceedings of the 12th International Conference on Data Engineering[C]. 1996,516-523.
  • 7Weitschereck D, Aha D W, Mohri T. A review and empirical evaluation of feature weighting methods for a class of lazy learning algorithms [J]. AI Review.1997, 11(2): 273-314.
  • 8T. Hastie, R. Tibshirani. Discriminant Adaptive Nearest Neighbor Classification [J]. IEEE TRASACTIONS ON PATTERNANALYSIS and MACHINE INTELLIGENCE. 1996, 18(6) : 607-615.
  • 9王煜,王正欧.基于模糊决策树的文本分类规则抽取[J].计算机应用,2005,25(7):1634-1637. 被引量:13
  • 10周水庚,关佶红,胡运发.隐含语义索引及其在中文文本处理中的应用研究[J].小型微型计算机系统,2001,22(2):239-243. 被引量:41

二级参考文献19

  • 1黄昌宁 等.对自动分词的反思[A]..语言计算与基于内容的文本处理[C].北京:清华大学出版社,2003,7.26-38.
  • 2Young P,学位论文,1994年
  • 3Shin C, Yun U, Kim H, Park S. A hybrid approach of neural network and memory-based learning to data mining. IEEE Trans. on Neural Networks, 2000, 11(3): 637 - 46.
  • 4Wettschereck D, Aha D W, Mohri T. A review and empirical evaluation of feature weighting metbords for a class of lazy learning algorithms. AI Review, 1997, 11 (2): 273 - 314.
  • 5范明 孟小峰.数据挖掘概念与技术:第七章第七节[M].北京:机械工业出版社,2001..
  • 6Kuncheva L I. Fitness Functions in Editing k-nn Reference Set by Genetic Algorithms. Pattern Recognition, 1997, 30(6):1041 - 1049.
  • 7Setiono R, Liu H. Neural-network feature selector. IEEE Trans.on Neural Networks, 1997 8(3): 654 - 662.
  • 8Guha S, Rastugi R, Shim K. CURE: An efficient clustering algorithm for large databases. In Proc. 1998 ACM-SIGMOD Int.Conf. Management of Data (SIGMOD'98), Seattle, WA, June 1998:73 - 84.
  • 9Pemg C, Wang H, Zhang S, parker D. Landmarks: A new model for similarity-based pattern querying in time series databases.IEEE Conf. on Data Engineering, 2000:33 - 44.
  • 10Quinlan J R. C4.5: Programs for Machine Learning. San Mateo,CA: Morgan Kaufmann, 1993, Chapter 3.

共引文献295

同被引文献117

引证文献15

二级引证文献72

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部