期刊文献+

基于样本重要性原理的KNN文本分类算法 被引量:6

The KNN Text Classification Based on Sample Importance Principals
下载PDF
导出
摘要 KNN是重要数据挖掘算法之一,具有良好的文本分类性能.传统的KNN方法对所有样本权重看作相同,而忽略了不同样本对于分类贡献的不同.为了解决该个问题,提出了一种样本重要性原理,并在此基础上构造KNN分类器.应用随机游走算法识别类边界点,并计算出每个样本点的边界值,生成每个样本点的重要性得分,将样本重要性与KNN方法融合形成一种新的分类模型——SI-KNN.在中英文文本语料上的实验表明:改进的SI-KNN分类模型相比于传统的KNN方法有一定的提高. As one of the top ten data mining algorithms,KNN has good performance of text classification. All samples are treated as the same as its weight in the traditional KNN method,but the question that the different sample has the different contribution to the classification has been ignored. To solve the problem,a sample importance principals and KNN classifier constructed on the basis of this principle has been presented. Using the random walk algorithm to identify these samples near the class boundary,and calculate the boundary value of each sample. To generate the score of sample importance of each sample from the boundary value,combined sample importance with KNN method to form a new classification model. Experimental results show that the new SI-KNN classifier has some improvement compared to the traditional KNN method on the Chinese and English text corpus.
出处 《江西师范大学学报(自然科学版)》 CAS 北大核心 2015年第3期297-303,314,共8页 Journal of Jiangxi Normal University(Natural Science Edition)
基金 国家自然科学基金(61272212 61163006 61203313 61365002 61462045)资助项目
关键词 文本分类 KNN 样本重要性原理 SI-KNN ext classification KNN sample importance principals SI-KNN
  • 相关文献

参考文献18

  • 1Rutkowski L,Jaworski M, Pietruczuk L, et al. The CARTdecision tree for mining data streams ! J ]. Information Sci- ences ,2014,266 : 1-15.
  • 2Jiang Liangxiao, Cai Zhihua, Wang Dianhong, et al. Bayes- ian citation-KNN with distance weighting[J]. International Journal of Machine Learning and Cybernetics, 2014, 5 (2) :193-199.
  • 3Bollen K A,Harden J J,Ray S,et al. BIC and alternative Bayesian information criteria in the selection of structural equation models [ J ]. Structural Equation Modeling: A Muhidisciplinary Journal ,2014,21 ( 1 ) : 1-19.
  • 4Rebentrost P, Mohseni M, Lloyd S. Quantum support vector machine for big data classification [ J ]. Physical Review Letters ,2014,113 ( 13 ) : 130503.
  • 5Utkin L V,Zhuk Y A. Robust boosting classification mod- els with local sets of probability distributions [ J ]. Knowl- edge-Based Systems,2014,61:59-75.
  • 6Vapnik V N, Vapnik V. Statistical learning theory [ M ]. New York: Wiley, 1998.
  • 7Hastie T, Tibshirani R, Friedman J, et al. The elements of statistical learning [ M ]. New York: Springer,2009.
  • 8Bermejo S, Cabestany J. Large margin nearest neighbor classifiers [ M ]. Springer Berlin Heidelberg, 2001,84: 669-676.
  • 9Domeniconi C, Gunopulos D, Peng J. Large margin nearest neighbor classifiers [ J 1. Neural Networks, IEEE Transac- tions on, 2005,16 (4) : 899-909.
  • 10Chai Jing, Liu Hongwei, Chen Bo, et al. Large margin nea- rest local mean classifier [ J]. Signal Processing,2010,90 ( 1 ) : 236-248.

二级参考文献57

  • 1钱晓东,王正欧.基于改进KNN的文本分类方法[J].情报科学,2005,23(4):550-554. 被引量:19
  • 2乔玉龙,潘正祥,孙圣和.一种改进的快速k-近邻分类算法[J].电子学报,2005,33(6):1146-1149. 被引量:25
  • 3罗欣,夏德麟,晏蒲柳.基于词频差异的特征选取及改进的TF-IDF公式[J].计算机应用,2005,25(9):2031-2033. 被引量:55
  • 4张国英,沙芸,江慧娜.基于粒子群优化的快速KNN分类算法[J].山东大学学报(理学版),2006,41(3):120-123. 被引量:8
  • 5HANJia-wei,Micheline Kanber著.数据挖掘概念与技术[M].北京:机械工业出版社,2007
  • 6Tan Pang-Ning,Steinbach M,Kumar V.数据挖掘导论[M].范明,范宏建译.北京:人民邮电出版社,2006.
  • 7Ogura H, Amano H, Kondo M. Feature selection with a measure of deviations from Poisson in text categorization [J]. Expert Systems with Applications, 2009,36(3) : 6826-6832.
  • 8Pan J S, Qiao Y L,Sun S H. A fast K nearest neighbors classifi- cation algorithm [J]. IEICE Trans FundamElectron Commun Comput Sci, 2004,E87-A(4) : 961-963.
  • 9Hart P E. The condensed nearest neighbor rule[J].IEEE Tran- sactions on Information Theory, 1968,14(3) : 515-516.
  • 10Wilson D L. Asymptotic properties ofnearesmeighbor rules u- sing edited data [J].IEEE Transactions on Systems, Man and Cybernetics, 1972,2(3):408-421.

共引文献57

同被引文献45

  • 1杨铭,陈建峰.基于CUDA的海量点云数据kNN查询算法[J].测绘通报,2012(S1):394-398. 被引量:3
  • 2杨斌,匡立春,孙中春,施泽进.一种用于测井油气层综合识别的支持向量机方法[J].测井技术,2005,29(6):511-514. 被引量:26
  • 3Rousseau F, Vazirgiannis M. Graph-of-word and TW-IDF: new approach to ad hoe IR [ C3. New York:ACM,2013: 59-68.
  • 4Kherwa P,Sachdeva A,Mahajan D,et al. An approach to- wards comprehensive sentimental data analysis and opin- ion mining [ EB/OL]. [ 2014-10-16 ]. 10. ll09/IAdCC. 2014.6779394.
  • 5Pang Bo, Lee L. A sentimental education:Sentiment analy- sis using subjectivity summarization based on minimum cuts [ EB/OL ]. [ 2014-10-23 ]. 10. 3115/1218955. 1218990.
  • 6杜振雷,张仰森,李文坤,等.基于多特征融合的中文微博情感分类方法研究[c].第五届中文倾向性分析评测研讨会,2013:44-49.
  • 7朱艳辉,杜锐,鲁琳,等.中文文本情感分析与比较句的识别研究[c].第五届中文倾向性分析评测研讨会,2013:34-43.
  • 8刘志广,董喜双,关毅.中文微博情感倾向性研究[C].第五届中文倾向性分析评测研讨会,2013:81-87.
  • 9蒋飞,刘奕群,张敏,等.THUIR-SENTI:COAE2013测评报告[EB/OL].[2013-10-17].http://wenku.55.1a/P一93139.html.
  • 10徐琳宏,林鸿飞,潘宇,任惠,陈建美.情感词汇本体的构造[J].情报学报,2008,27(2):180-185. 被引量:386

引证文献6

二级引证文献21

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部