摘要
信息采集技术日益发展导致的高维、大规模数据,给数据挖掘带来了巨大挑战,针对K近邻分类算法在高维数据分类中存在效率低、时间成本高的问题,提出基于权重搜索树改进K近邻(K-nearest neighbor algorithm based on weight search tree, KNN-WST)的高维分类算法,该算法根据特征属性权重的大小,选取部分属性作为结点构建搜索树,通过搜索树将数据集划分为不同的矩阵区域,未知样本需查找搜索树获得最"相似"矩阵区域,仅与矩阵区域中的数据距离度量,从而降低数据规模,以减少时间复杂度。并研究和讨论最适合高维数据距离度量的闵式距离。6个标准高维数据仿真实验表明,KNN-WST算法对比K近邻分类算法、决策树和支持向量机(support vector machine, SVM)算法,分类时间显著减少,同时分类准确率也优于其他算法,具有更好的性能,有望为解决高维数据相关问题提供一定参考。
The ongoing development of information acquisition technique results in high-dimensional and large-scale data,which enormously challenges the data mining.Aiming at low efficiency and high time cost of K-nearest neighbor classification algorithms in high-dimensional data,an improved K-nearest neighbor algorithm based on weight search tree(KNN-WST)for high-dimensional classification was proposed.The algorithm selected some attributes as nodes to construct a search tree according to the weight of feature attributes.The search tree divided the data set into different matrix regions.Unknown samples needed to find the search tree to obtain the most"similar"matrix region,and only calculated the distance from the data contained in the matrix area.Thus,it reduced data size,and so as the time complexity.And the most suitable Minkowski Distance for distance measurement of high-dimensional data were discussed and analyzed.Simulation experiments on 6 standard high-dimensional data show that the classification time of KNN-WST has better performance than that of the K-nearest neighbor,decision tree and SVM.Its classification time is significantly reduced and classification accuracy is better than other algorithms.KNN-WST has better performance on the classification of high-dimensional data,which is expected to give some references for solving the related problem of high-dimensional data.
作者
梁淑蓉
陈基漓
谢晓兰
LIANG Shu-rong;CHEN Ji-li;XIE Xiao-lan(College of Information Science and Engineering,Guilin University of Technology,Guilin 541004,China;Guangxi Key Laboratory of Embedded Technology and Intelligent Systems,Guilin 541004,China)
出处
《科学技术与工程》
北大核心
2021年第7期2760-2766,共7页
Science Technology and Engineering
基金
国家自然科学基金(61762031)
广西科技重大专项(桂科AA19046004)
广西重点研发项目(桂科AB18126006)。
关键词
高维数据
K近邻分类算法
特征属性
搜索树
闵氏距离
high-dimensional data
K-nearest neighbor classification
characteristic attribute
search tree
Minkowski Distance