摘要
为解决基于TF-IDF的KNN算法在文本分类时没有考虑文本特征值的多峰分布和文本相似度的计算量导致分类性能差的问题,提出一种基于搜索改进的KNN文本分类算法SIKNN(KNN text classification algorithm based on search improvement)。计算待测样本与聚类后每个类别中样本的平均相似度;当很容易就确定待测样本所属类别时,就停止该待测样本与其它类别中样本相似度的比较,缩小文本相似度计算的搜索空间,提高文本分类的速度。该算法在数据集20-Newsgroups上分别与传统的KNN算法和改进的KNN算法做对比实验,实验结果表明,该算法能够明显提高KNN算法的分类性能和分类速度。
The traditional KNN classification algorithm based on TF-IDF does not consider the multimodal distribution of sample’s feature value and the computation of text similarity in text classification,which leads to poor classification performance.To address this problem,the KNN text classification algorithm based on the search improvement(SIKNN)was proposed.The avera-ge similarity was computed between the measured sample and the sample in each category after clustering.When the algorithm was easy to identify the measured sample’s category where it belonged to,it stopped computing the text similarity between the measured sample and the samples in other categories,which reduced the search space of text similarity computation and improved the speed of text classification.The algorithm was compared with the traditional KNN algorithm and the improved KNN algorithm on the 20-Newsgroups data set.Experimental results show that the proposed algorithm can significantly improve the classification performance and classification speed of KNN algorithm.
作者
殷亚博
杨文忠
杨慧婷
许超英
YIN Ya-bo;YANG Wen-zhong;YANG Hui-ting;XU Chao-ying(School of Information Science and Engineering,Xinjiang University,Urumqi 830046,China;School of Software,Xinjiang University,Urumqi 830046,China)
出处
《计算机工程与设计》
北大核心
2018年第9期2923-2928,共6页
Computer Engineering and Design
基金
国家973重点基础研究发展计划基金项目(2014CB340500)
国家自然科学基金项目(U1603115
61262087)
关键词
K最近邻
文本分类
相似度
多峰分布
聚类
K-nearest neighbor
text classification
similarity
multimodal distribution
clustering